MEASUREMENT 
AND EVALUATION 
IN PSYCHOLOGY 
AND EDUCATION 



Copyrieht, 1955 © 1961 by John Wiley & Sons, Inc- 


All rights reseneJ. This book or any part lliereol 
mast not be reproduced in any farm niihaul 
the Hritten permission of the publisher. 

Library of Congress Catalog Card Number: 61'U494 
Printed in the United States of America 



Preface 


The reception that this book received in its first edition has been 
sudicicntly favorable, and the comments that have come back to us 
have been sufficiently kind, that we have tried to make this revision 
an evolution rather than a revolution. 

We have, of course, tried to make adequate reference to major new 
tests that have appeared in the last six years— form L-M of the 
Slanjofd-Binett WAJS, STEP, and others. We have also tried to make 
some reference to significant recent research, where it fils the pattern 
of an introductory text. 

Beyond that, we have redone a number of sections from the first 
edition — ^sections wiiii which we or our users had been less than com- 
pletely happy. Thus, the section on validity has been rather com- 
pletely restructured to represent our current thinking. A fuller ex- 
position has been developed on planning and blueprinting a test. We 
have dealt more fully with the practical mechanics of a testing pro- 
gram. We have reorganized and added to the material on aptitude- 
test batteries. 

Id two instances, wc have changed the order of chapters into a 
sequence that seems to us more teachable. However, whenever in- 
formation is presented in sequence, one feels die need for what is to 
come later while one is discussing what has come before. Some who 
have used the book in classes have told us that they teach the chapters 
in various sequences different from that which appears in the book, 
and, fortunately, this seems lo lead lo no insuperable problems. 

We continue in the basic belief that it is as important for students 
to learn what tests will fior do as to learn what they hill 6o\ as impor- 
tant to examine their own purposes and objectives for testing as to 
examine the tests. It is in the hope of developing more restrained, 
discriminating, and insightful testers that we offer this book to our 
colleagues and students. 

Robert L. Thorndike 
Elizabeth Hxcen 

New York, N, Y, 

April, 1961 



Contents 


CHAPTER 

1. Historical and Philosophical Orientation 

2. Overview of Measurement Methods 

3. The Teacher’s Own Tests 

4. Preparing Objective Tests 

5. Elementary Statistical Concepts 

6. Norms and Units tor Measurement 

7. Qualities Desired in Any Measurement Procedure 

8. Where to Find Information about Specif.c Tests 

9. Standardized Tests of Intelligence or Scholastic 
Aptitude 

10. The Measurement of Special Aptitudes 
1 1 Achievement Tests 

12. Questionnaires and Inventories for Self-Appraisal 

13. The Individual as Others See Him 

14. Behavioral Measures of Personality 

15. Projective Tests 

16. Planning a School Testing Program 



COKIEHIS 


CHAPTER 

17. Marking and Reporting 

18. Measurement in Educational and Vocational 
Guidance 

19. Tests in the Selection of Personnel 


484 

521 

542 


appendix 


I. Computation of Square Root 

565 

il. Calculating the Correlation Coefficient 

567 

ITT. Section A 

General Intelligence Tests 

571 

Section B 

Aptitude Test Batteries 

575 

Section C 

Reading Tests 

577 

Section D 

Elementary-School Achievement 



Batteries 

582 

Section E 

High-School Achievement Batteries 

584 

Section F 

Interest Inventories 

586 

Section G 

Adjustment and Temperament 



Inventories 

588 


IV. Sources for Educational and Psjchological Tests 592 


Index 


597 



Chapter 1 

T 

Historical and Pliilosopliical 
Orientation 

HISTOfilCAl BACKGROUND 

The roots of the measurement of man lie in antiquity. We must 
believe that even in prehistoric times Og, the cave man, made rudi- 
mentary appraisals of his fellows. He saw Zog go by, made some 
such judgment as “Big, strong, keep out of way,'' and acted upon 
it; or he came upon the campfire of Wog, observed “Small, weak, 
take dinner,” and did so forthwith. But for much of recorded his- 
tory, the appraisals that man has made of his fellows have been of 
this crude subjective type. 

He who seeks imaginatively can find suggestions of more systematic 
and refined methods. Thus, the tournaments of the days of chivalry 
can be thought of as an effort to arrange men in an order from best 
to worst in feats of arms, and contests leading to the crowning of 
“champions” hat'C always constituted a rough sort of measurement. 
Teachers have always catechized their pupils to appraise their degree 
of mastery of the tasks assigned them, evaluating them as best they 
could by their responses. But these approaches were more primitive 
than the sun dial and the ox cart. They arc characteristic of the ap- 
praisal of man and his behavior up to the present century. Applica- 
tion of the quantitative methods of science to psychology and educa- 
tion is very new. In 1850 there was abnost none of it; 1900 \vas still 
a pioneering period. 

£ARLY EDUCATIONAi TESTING 

The appraisal of educational achievement in the United States be- 
fore 1850 had relied very largely upon oral examination. The teacher 
or visiting examiner asked a question. The designated pupil under- 
took to answer It. The questioner arrived at an immediate subjective 
evaluation of the answer. There was uniformity neither in the qiics- 



2 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


lions asked different pupils nor in the evaluation of their replies. The 
method was burdensome and inefficient, since only one pupil could be 
tested at a time. It provided no comparability from pupil to pupil 
either in the task or in the evaluation of it. 

During the latter half of the nineteenth century, oral examinations by 
boards of visitors were replaced by set written examinations as a basis 
for promotion or admission to an academy or college. Outside exam- 
ination in turn yielded to evaluation by the classroom teacher. 
Whether carried out by an outside examiner or by a teacher, however, 
the technique was that of the essay examination, in which a pupil re- 
sponded in his own words to a question set by the examiner. 

The written examination had advantages over the oral examination 
of ( 1 ) presenting the same tasks to each member of the group and (2) 
letting each pupil work for the full examination period. However, 
though the task was made uniform, at least for the members of a given 
class, appraisal of each individual’s response to the task remained 
highly subjective, depending upon the standards and prejudices of the 
particular scorer. As we shall see In Chapter 3, great variations were 
found in the scoring of a particular paper. Only since 1900 has there 
been any general development of objectively scored tests in which a 
pre-established key can be routinely and uniformly applied to the 
responses made by each pupU. Only since 1900 has the idea emerged 
of a general standard of performance for an age or grade, with which 
the performance by any class or any individual may be compared. 


THE BEGINNINGS Of PSYCHOLOGICAt MEASUREMENT 
Psychology m 1850 was stUl in large measure a part of philosophy. 
Courses dcatag with man and his aetions were presented under the 
e r'l ‘'""=hair fashion the 

M was almost entirely 

non-experimcntal, and the idea that one tmuld measure in quanti- 
auve terms fte speed of responding, the amount of forgetting or the 
level of mtelhgence would have been received in most quarters with 
hostihty or, more probably, ignored as not worthy of rrtuttal. The 
nearest approaches to psychological measurement were a tew s^ttered 

meX reTpcLs ^ «-nm, natrons and the speed of simple ele- 

.. , 1 . ».,*s»sxsr.“ ; j: 



HlSrOJHCAl BACKGROUND 3 

alliances with the biological sciences. Il had adopted the elperi- 
mentiU method and was measurement-conscious. The basic tool of 
expemnentation is measurement, and psychology was expanding its 
measurement techniques in all direcUons. The record since i900 Is 
the record of the attempt to expand and adapt measurement tech- 
niques to cover all aspects of human behavior. 

Three main streams combined to yield the vigorous measurcraenl 
movement m psychology and its spread through education. Some of 
the flavor and some of the emphasis have come from each stream. 
These were (1) the physiological and experimental psychology that 
had its main growth in Germany m the nineteenth century, (2) Dar- 
winian biology, and (3) the clinical concern for the maladjusted and 
underdeveloped individual. 

BEGINNINGS Of EXPERiMENtAl MVCHOtOCY 

The modern, scienlifle era was first ushered into the physical sciences 
in the seventeenth and eighteenth centuries. Scientific interest and 
method soon spread over to the biological sciences, and by the early 
nineteenth century experimental physiology was a center of active re- 
search interest in the experimental laboratories in Germany aod other 
European countries. Experimental physiologists became interested in 
the operation of the senses, studying intensively seeing, hearing, and 
the other senses. Physiologists also became interested in measuring 
the speed of simple motor responses. 

In 1879 the first laboratory for experimental psychology was estab- 
lished by Wilhelm Wundt at Leipzig. Early experimental psychologists 
were interested in many of the same measurements that had concerned 
the physiologists. These were measures of seeing, hearing, feeling, 
and speed of response. But gradually they extended their concern 
to more clearly ps)'choIogical mailers, such as measurement of per- 
ceptual span — the amount that the Individual can “take in” at once, of 
rate of learning, of the timing of complex mental tasks, and so forth. 

One area of particular mlcrest for its contribution to the broad 
field of psychological and educational measurement was that known 
as psychophysics. The experimental psychologist was much interested 
in exploring the relationship between physical stimulus intensities, c.g., 
of light wave or of sound wave, and the experienced intensity of the 
resulting sensation. The designing of effective experimental proce- 
dures for studying these problems gave rise to a set of techniques that 
have proved adaptable to a wide range of problems of psychological 
measurement. 



4 HISTORICAL AND PHILOSOPHICAL ORIENTATION 

From experimental psychology came a legacy of respect for care- 
ful experimental method and precision of technique, a number of ex- 
perimental designs, and statistical techniques that could be earned 
over to more general psychological and educational measurement 
problems. 

EARLY STUDY OF INDIVIDUAL DtFFERENCES 

A second stream contributing to psycholo^cal measurement was 
Darwinian biology. In 1859 Darwin brought out his Origin of 
Species. The basic concern in Darwin’s work was with variation 
among the members of a species, that is, individual differences. Dar- 
win's work was followed up in England and applied to distinctively 
human affairs, particularly by Sir Francis Gallon. Whereas German 
psychology had focused on finding the general fads true of all people, 
Gallon became interested primarily in the differences among people. 
Stimulated by Darwin to study the inheritance of traits, he gathered 
data both on physical and on psycholo^cal characteristics. The study 
of these individual differences requited better statistical tools, and the 
British group, under the leadership of Karl Pearson, developed Im- 
proved techniques for analyzing and describing the patterns of indi- 
vidual differences. 

These, then, were the two main contributions of the British group 
to the growth of psychological measurement: a deep concern for 
studying the differences among people as interesting and significant 
facts and appropriate statistical techniques and tools for carrying out 
this study. 

CUNJCAL STUDY OF DEVIATES 

During this same period, a third stream was gathering strength. 
This was concern for the individual who was not functioning success- 
fully. Humanitarian concern for the insane, the feeble-minded, and 
the general misfit led in the nineteenth century to active research and 
investigation aimed toward understanding their condition and improv- 
ing their lot. This clinical interest in the maladjusted individual was 
particularly strong in France, and it was here that it bore fruit for the 
field of measurement. As psycholo^ts worked with these unfortu- 
nate deviates, the need became more and more apparent for some 
uniform way of expressing the degree of their defect, particularly in 
the mental sphere. It was in this context of concern for the child 
who was not getting along in school that Binet and his colleagues de- 
veloped the series of iiUellcctual tasks that ultimately grew into the 
whole array of measures of intelligen ce. 



historical background s 

SYNTHESIS IN THE UNITED STATES 

By the early years of the present century, all these streams of in- 
fluence had made themselves felt in the United Slates. James McKeen 
Cattcll had taken his graduate tt'ork in psycholoiiy in Germany with 
Wundt, where he had received a good grounding in quantitative and 
Mperimenta] psychology. But he had also been exposed to the work 
of Gabon and had developed a lasting interest in individual differences 
and statistical method. When he returned to the United States, he 
began an investigation of individual differences in the simple sensory 
and motor performances that were being measured in German psycho- 
logical laboratories. He studied the relationship between these per- 
formances and academic success. 

E. L. Thorndike was a student of Caitell’s just before the turn of 
the century and became a focal influence in the spread and develop- 
ment of standardized educational tests. Both his own work and that 
of a large group of students rapidly spread the gospel of objective 
measurement in education. 

The work of Binei was eagerly seized upon in this country. His 
tests were translated and produced in several versions, of which by 
far the most influential became the Sianford-Dhiet first produced by 
Lewis Terman in 1916. The testing movement seemed especially 
suited to the temper of (his country and took hold here with a vigor 
and enthusiasm uncqualcd elsewhere. 

MEASUREMENT (N THE TWENTfElH CENTURY 

The first 60 years of the twentieth century may conveniently be 
divided into four equal parts, so far as the recent history of psycho- 
logical and educational measurement is concerned. We may desig- 
nate the period from 1900 to 1915 the pioneering phase. This was 
the period of exploration and initial development of methods. It saw 
the emergence of the first Binct intelligence scales and their American 
revisions. Standardized achievement tests in dilferent subjects began 
to appear, exemplified by Stone’s arithmetic tests, Buckingliam’s spell- 
ing tests, and Trabue's language tests. Thorndike developed bis first 
handwriting scale. Otis and others were initiating work on group 
tests of intelligence. „ 

The next 15 years, 1915 to 1930, can perhaps be called the ' boom 
period in test development. The pioneers had shown the way, and 
in the hands of enthusiastic followers tests multiplied like rabbit. 
Standardized tests were developed for all the school skills and for the 
content areas of the school program. Achievement batterjes made 
their appearance. Starting with Arnty Alpha of World War I, group 



6 HISTORICAL AND PHILOSOPHICAL ORIENTATION 

intelligence tests were produced in great numbers. Also starting with 
a wartime product, the Woodworth Personal Data Sheet, a whole line 
of personality questionnaires and inventories came into being. 

The rapid development of testing instruments and methods was 
pushed by a group of enthusiasts. They were converts who had 
“gotten the word.” Their enthusiasm was contagious and extended 
not only to the production of tests but also to their use. Tests of 
intelligence and achievement were administered widely and somewhat 
indiscriminately. Test results were often accepted unhesitatingly and 
uncritically and served as the basis for a variety of unjustified judg- 
ments and actions with respect to individuals. In the expansive flood 
of enthusiasm for objective measurcmcni, some enthusiasts were not 
inclined to be critical of their instruments or the interpretation of re- 
sults from them. Many sins were committed in the name of measure- 
ment by uncritical test users. 

After a while the pendulum began to swing back. More and more 
sharply voiced criticisms of tests and of the uses made of tests began 
to be heard. Heredity-environment discussions became acrimonious. 
The use of test scores as a basis for classroom grouping became the 
subject of bitter attack. Criticism was directed at specific tests in 
terms of their limited scope and their emphasis upon restricted and 
traditional objectives. It was also directed at the whole underlying 
philosophy of quantification and the use of numbers to express psycho- 
logical qualities. 

The critical attack had the healthy effect of forcing the lest en- 
thusiasts themselves to become more critical of their assumptions and 
procedures and to broaden their approach to the whole problem of 
psychological and educational appraisal. From about 1930 to 1945 
may be considered a period of critical appraisal, of taking stock, of 
broadening techniques and delimiting interpretations. It was a period 
in which the center of attention shifted from “measuring” a limited 
range of academic skills to “evaluating” achievement of the whole 
range of educational objectives. It was a period in which the holistic, 
global projective methods of personality appraisal came to the fore. 

It is difficult to view with any perspective at all events that have 
taken place within the last 15 years. History may eventually charac- 
terize the period quite differenUy than do we, standing so close to it. 
However, we will venture to predict that the period from 1945 to 
1960 will be characterized as the period of test batteries and testing 
programs. Partly as a result of their successful use in World War II, 
integrated aptitude batteries for educational and personnel use have 
mulUplied during this period. And the large-scale testing programs. 



PHIlOSOPHICAt O8(ENrATl0N 7 

such US those administered by the Collcje Entrance Etsaminalion 
uoard, though stemming from much earlier in the century, have ex- 
panded in size and multiplied in numbers at a striking rate. We have 
experienced a second boom period-act so much in test devetopment 
and construction, as in test administration and use. The mid-twen- 
U'eiJj century is a period in which standardized testing is a widely 
experienced and widely accepted phenomenon of our American 
culture. 


Under these circumstances it is particularly important that con- 
struction, use, and interpretation of these instruments be well under- 
stood by teachers, guidance workers, and psychologists for whom they 
arc daily tools of the trade. It is also important that the phenomenon 
of standardized testing be understood by the citizens who arc exposed 
to it in their search for employment for themselves or education for 
their children. Therefore, let us try at this point to formulate a philos- 
ophy of measurement that wi/I take into account the icssoos of the last 
60 years, and will serve to guide our attack on measurement problems 
and our use of measurement techniques in the years ahead. 


PHUOSOPHICAI orientation 

In education and in psychology we are concerned with human 
beings. Sometimes we are concerned with them as specific individ- 
uals, as when we want to know why Mary is having so much difficulty 
in learning to do long division. Sometimes we are concerned with 
them as specific groups of individuals, as when we inquire whether 
the children in class A can read as well as those in class B. Some- 
times we are concerned with them as general representatives of man- 
kind, as when we try to determine whether children with high verbal 
intelligence tend to show more or less signs of emotional disturbance 
than children of average intellectual ability. 

KNOWtlDGE AS A GUIDE TO ACTION 

In practically all of education and in much of psychology, our con- 
cern about individuals is to do something about them, Individu^ly 
or collecUvcly. In so tar as ediicatioa is a science, it is an applied 
science, ami in psychology, too, the applied aspects balk large in the 
present scene. The educator or the practical psychologist is con- 
tinually faced with the necessity of arriving at some decision as to a 
course of action. Ho must decide what to do about an individual 
or individuals, or he must help the person himself decide what to do. 
He must decide in which grade to place a child or what special in- 



8 HISTORICAL AND PHILOSOPHICAL ORIENTATION 

struction to provide (or him. He must reacil a diagnosis of a child 
with a reading disability, with a view to recommending trcatmenU He 
must recommend whether or not to employ a job applicant, . He must 
help a student decide whether to plan for college and, if so, what sort 
of program to take and what type of job to aim for. Fhc educator or 
psychologist wants each one of these decisions to be a sound and wcil- 
conceived one. 

Our basic assumption is that sound decisions arise out of relevant 
knowledge of the individual or individuals. Wc assume that ihc 
more wc know about a person that relates to our present decision, 
and the more accurately wc knowr it, the more likely wc arc to arrive 
at a sound decision about him or a wise plan of action for him. By 
the same token, wc assume that the mote relevant and accurate in- 
formation wc can provide the individual about himself, the more 
likely he is to arrive at a sound decision on his own problem. It may 
be necessary for us to qualify this assumption as wc proceed. There 
may be limits on the amount and kind of information that can be used 
in a particular situation. Wc shall indicate that knowledge in and of 
itself is not wisdom. But in its general form the assumption is basic 
not only to educational and psychological measurement but also to all 
science. Wc assume basically that knowledge is good, that knowledge 
is power, that knowledge is the basis for cITcctivc control of the prob- 
lems that confront us from day to day. This is a basic tenet of our 
faith. 

\Vhat does it mean to “know an jndividual’7 Fundamentally, to 
know an individual means to be able to describe him accurately and 
fully. If w’e know John Jones well, wc can describe not only how he 
looks — how tall he is and how heavy, the color of his hair and eyes, 
the birthmark under his chin. Much more importantly, wc can de- 
scribe what he can and will do — how he will dress, what he is likely 
fo taffc about, what he will be interested in, what types of tasks he can 
do and how well he can do them, how he will respond to the different 
stresses and strains of life. To know a person completely means to be 
able to describe him completely, to predict how he wUl behave in 
every possible situation. Obviously, wc arc far, far away from this 
objective, and we always wUl be. The function of educational and 
psychological measurement is to move us a little closer to it. 

IMPORTANCE Of MEASURING THE RIGHT THING 

The clfectwness of our description of any object or penon depends 

upon two things. It depends (I) upon how wisely we have chosen 



PHllOSOPHICAl OSIENMPION 9 

“■* (2) upon how Mly and accurafdy 
we have managed to describe each one. 

A dcscripUon may fail to be useful for the need at hand because 
we choose irrelevant features to describe. Thus, in describing a paint- 
ing we might report its height, its breadth, and its weight. We might 
report these with great predsion. if our concern were to crate the 
picture for shipment, these mt^i be just the items of information we 
would need. On the other hand, if our purpose was that of charac- 
terizing the painting as a work of art, our description would be worth- 
ies. The attributes of the picture we had described would be essen- 
tially irrelevant to its quality as a work of art. 

Similarly, a description of a person may be of liiiJc value for our 
purposes if we choose the wrong things to describe. Thus, the Air 
Force In selecting pilots to fiy jet fighters might get very accurate in- 
formation on height, weight, years of education, size of vocabulary, 
and speed of reading for all Its applicants. Ji would almost surely 
find, however, that none of these things helped at all in selecting the 
men who could successfully learn to fiy the planes. Such factors as 
these arc in large measure irrelevant to dying success, which appears 
to depend more on mechanical know-how and on motor coordination. 

Again, a high school concerned about assessing ihe level of literary 
appreciation in its pupils might prepare a test inquiring exhaustively 
into the names of the characters and the details of the plot of Shake- 
speare's Julius Caesar. The worthlessness of this procedure may be 


less obvious, but is probably just as real as that proposed for the 
selection of pilots. T^is test seems useless for the task at hand, be- 
cause detailed factual knowledge about an isolated literary work is 
no indicator of the quality of a pupil’s literary appreciation. The test 
has asked the wrong kinds of questions. The evidence it provides is 
related to a faulty interpretation of the original question that was asked. 

The first, and perhaps the most important, step in any project for 
educational or psychological measurement is defining Just what it is 
that we wish to measure and detennimog what operations will serve 
to measure it. Educational objectives are fakely to be incomp/etc/y 
formulated and expressed in vague terms. The concepts must be 
clarified and made more specific before we can make much Progress 
toward sensible procedures of measurement.. Until we can decide 
what is meant by “good citizenship” or what behaviors are exhibited 
by a person who shows “understanding of scientific method, we have 
little prospect of developing procedures to appraise either the one or 


the other. 



10 


HISTORICAI AND PHlLOSOPHICAt ORIENTATION 


THE NEED FOR PRECISION 

Our description may be of limited value in the second place because 
the attributes we elect to describe arc described inaccurately. Thus, 
if our description of the painting were expressed in terms of theme, 
composition, line and volume, and color values, it would certainly be 
a good deal more to the point as an appraisal of a work of art. But 
it would be much more wordy, more subjective, and less precise than 
our previous description of length, breadth, and weight. Different 
persons could be expected to differ markedly in the qualities they saw 
and the terms they used to describe them. This might be true to such 
an extent that a single individual’s description would give us only a 
very rough, unclear, and undependable impression of the picture as 
a work of art. 

As for the candidate for pilot training, we might get ratings from 
his friends on his speed of learning new coordinations, ability to pay 
attention to many things at once, and resistance to disturbance by 
emotional stress. Wc may hazard a guess that these ratings would 
again prove ineffective in predicting pilot success — not so much be- 
cause the qualities themselves are unimportant, but because wc arc 
not skillful in observing such qualities in our fellows or in expressing 
our observations in exact quantitative foim. 

Our high school, concerned with literary appreciation, might ask 
each pupil to write a report on some book he had read recently, tell- 
ing what he had liked about it, and why he had or had not thought 
it was a good book. Again, we may feel that such a report would 
provide information more related to appreciation than would a test 
of factual knowledge. But judging the quality of appreciation shown 
in a varied collection of compositions about ar. assortment of differ- 
ent books would be a very subjective enterprise, and the judgments 
would tend to be quite undependable. "Each judge would have his 
own personal standards of what constituted good literary apprecia- 
tion. He would make his judgments in terms of those personal stand- 
ards. There would be little agreement from one judge to the next 
as to who had shown good appreciation and who had shown poor. 
Our appraisal would be unsatisfactory because it would be inaccurate. 


DECR£eS OF REFINEMENT IN MEASURING 

There is enormous variation from one trait to another in the degree 
oi refinement wc have been able to achieve in describing it. At the 
cnidcst level, our appraisal may come to no more than a simple two- 
way classification. This may take the form present— absent- eg.. 



PHILOSOPHICAL ORIENTATION jj 

John lisps but Bill does not lisp; or the form trait— opposite: e.g., 
John runs fast but Bill runs slov^Iy. 

A somewhat more refined level of description is achieved when we 
characterize the trait by a set of adjectives which represent degrees 
of the trait; c.g., John runs fast, Joe goes like a streak, Jack runs 
fairly fast. Will goes like molasses in January. But the number of 
such qualitative descriptions is limited, and the meaning of such adjec- 
tives or similes is far from uniform from person to person. 

A still further le^-el of refinement in description is reached when 
we can arrange the members of a group in rank order with respect 
to an attribute and when we can locate any individual on such a rank 
order. Thus, we may say Joe runs the fastest, John runs faster than 
Jack, Jack runs faster than Will, and Will runs the slowest. Such a 
procedure of ranking could theoretically be extended to include all 
the children in a class, in a school, or even all the children in the whole 
country. Gearly, when we can appraise a trait well enough to pro- 
duce such a ranking, a very great increase in the adequacy of our de- 
scription has been achieved. 

Finally, some attributes may be expressed in a quantitative state- 
ment of amount. Thus, we may be able to report that Joe ran 100 
yards in 10 seconds, John in 14 seconds, Jack in 15 seconds, and Will 
in 17 seconds. This last is clearly the most precise type of statement 
of Uic essential facts and the one that makes us best able to decide upon 
appropriate action with regard to an individual, so far as that action 
depends upon speed of running 100 yards. It is certainly the type 
that the track coach would want to have before deciding whom to 
keep on the track team. 

We have identified four points along a scale of quantification and 
precision of measurement. 

1. Either — Or, A pupil is cither a boy or a girl. A man is single, 
married, widowed, or divorced. A student is enrolled in the coUege 
preparatory, commercial, or general curriculum. 

2. Qualitatively Described Degrees. Thus, a pupil may show^ nor- 
mal speech,” “slight stuttering,” “stammer,” “marked stutter. Or 
the pupils in a class may be character^ as “quiet and relaxed, 
“slightly fidgety,” or “tense and restless.’ 

3. Rank in a Croup. Thus, a scries of graded tasks scored by 
uniform standards enables us to find who does best and who does 
worst on reading comprehension, arithmetic problems, or spelling. 
The rest of the group can be arranged in order from best to worst. 



12 HISTORICAL AND PHILOSOPHICAL ORIENTATION 

4. AmoinU, Expressed in Uniform Established Units. A boy 
Nveighs 56 pounds, is 45 inches tall, is 6Yi years old. 

This wide variation in the refinement of our appraisals must be 
frankly admitted. Some traits we may never be able to express more 
accurately than by a “very” and “not very" characterization. Our 
failure to have achieved greater refinement in measuring these trails 
is probably partly due to lack of clarity and sharpness in definition of 
the attribute that we propose to describe. When we characterize a 
person as sincere, cultured, socially adjusted, cooperative, a good citi- 
zen, our hearer may have only a very general idea of what we mean. 
(And, as a matter of fact, so may we.) In pan our failure is certainly 
due to the limited ingenuity and skill we have shown to date in finding 
ways to represent degree or amount of the attribute with precision. 

It may sometimes be partly due to the essential nature of a particular 
attribute, which makes it fundamentally not expressible in quantitative 
form. There may be some things that, in their very nature, can never 
be quantified. 

Certainly, our present ability truly to measure many of the attri- 
butes of persons that appear to be relevant and important for making 
decisions about them and planning actions with respect to them leaves 
much to be desired. However, while recognizing this fact we must 
also appreciate that enormous strides have been made since 1900 
toward more objective and more accurate appraisals of human beings. 
The fact that we are limited in some directions does not lessen the 
value of increased precision wherever such increased precision has 
been achieved. While keeping a critical eye upon the limitations of 
measurement procedures, we should still use them for all they are worth 
in increasing the accuracy of our information about students, em- 
ployees, or clients. 

CRITICISMS Of PSYCHOLOGICAL AND EDUCATIONAL MEASUREMENT 
Since about 1930, psychological and, particularly, educational 
measurement have come in for a good deal of criticism. The educa- 
tional philtjsophcrs have been especially outspoken in expressing their 
dissatisfaction. In part, the criticisms have been directed at the 
basic logic of psychological nreasurement. These criticisms have been 
directed at the limitations we have just been discussing, as well as at 
some other problems concerning the equivalence of units and scores, 
which we shall consider btieSy in a later chapter. In part, however 
the cnticisms have been directed at the effecu that the measurement 



PHIlOSOPKICAt ORfCNTAHON )3 

procedures have bad upon school practice. The following types of 
cniicisms have been made; 

T Standardized measu/enjcnt procedures have been said to foster 
undemocratic practices and attitudes in the classroom. Forming 
homogeneous class groups on the basis of an intelligence or achieve- 
ment test is a specific practice that has been the target of this criticism. 

2. It has been contended that standardized tests have had the effect 
of freezing the curriculum and of preventing experiment and change, 
on the grounds that the commercial standardized lest typically lagged 
behind the advance of educational thought and practice. 

3. The limited scope of many standardized tests has been pointed 
out, and it has been indicated that they fail to appraise many of the 
changes in children that schools should be interested in producing. 

4. The short-answer test items have been accused of producing 
undesirable study habits directed loivard piecemeal memorization 
rather than understanding. 

There has been at least a germ of truth in all these criticisms. Some 
of them wc shall consider in mote detail in later chapters. As the 
criticisms are examined, however, we find that they are not primarily 
criticisms of obtaining more information about the individual. They 
are criticisms either of (a) the incompleteness and imperfection of the 
information yielded by our measurement procedures or (b) the unwise 
things that we do with that information. It is as though we condemned 
the physicist because (a) he cannot yet control the weather, and (b) 
his knowledge has led to the construction of atom bombs which may 
destroy mankind. We must grant that our measurement procedures 
arc not complete and our actions based on them arc not always wise. 
But the remedy lies in developing better measurement procedures that 
will give us more complete and more accurate information about the 
individual. It lies in gaining better understanding of our measures— 
their strengths and their weaknesses — so (hat we may use them with 
more wisdom. It does not lie in getting less information. 

It cannot be too much emphasized that measurement at best pro- 
vides only information, not judgment. A test will yield only a score, 
not the conclusion to be drawn from that score. The information 
provided in a test score is not a substitute for insight. This informa- 
tion is the raw material with which insight must work, in the clinic, 
in the classroom, and in the research laboratory. Experience, train- 
ine and basic sagacity most provide the insight that wiU take a set 
of data about an individual or group, know how much faith to place 



14 HISTORICAL AND PHILOSOPHICAL ORIENTATION 

in Them and whaL meaning to give them, and draw from them a sound 

conclusion or plan for action. 

Furthermore, it should be emphasized that the information that any 
measurement procedure gives is limited. It is limited by the nature of 
the measurement instrument itself. The typical intelligence test, for 
example, samples certain types of performances with abstract ideas ex- 
pressed in symbolic form. H is not a measure of the general worth of 
the individual, of his ability to acquire mechanical skills or artistic tech- 
niques, or of his integrity and dependability as a member of society. 
The information is limited by. the conditions under which the pro- 
cedure is applied. Thus, an intelligence test given to an emotionally 
disturbed and resistant child may give a very inadequate picture of 
what that same child could do if the disturbing influences were re- 
moved and the resistance overcome. Learning to use measurement 
results wisely is in part learning what information a particular device 
does and does not provide and in part learning under what circum- 
stances that information Is likely to be trustworthy. Throughout this 
book there will be recurring attempts to guide that learning. 

SUMMARY STATEMENT 

We can summarize much of the foregoing discussion on a working 
philosophy of measurement in the following four points. 

1. The process of measurement is secondary to that of defining 
objectives. The ends to be achieved must first be formulated clearly. 
Then measurement procedures can be sought as tools for appraising 
the extent to which those ends have been achieved. 

2. Much of educational and psychological measurement is, and will 
probably remain, at a relatively low level of precision. We must 
recognize this fact, using the best procedures available to us, but 
always treating the resulting score as a tentative hypothesis rather 
than as an established conclusion. 

3. The more elegant procedures of formal test and measurement 
must be supplemented by the cruder procedures of informal observa- 
tion. anecdotal description, and rating if we are to obtain a descrip- 
tion of the individual that is usefully complete and comprehensive. 

4. No amount of ingenuity in developing improved procedures for 
measuring and appraising the individual will ever eliminate the need 
to interpret ^e results from those procedures. Measurement proce- 
dures arc only tools. Insight and skill are required in the use of such 



QUESTIONS FOR DISCUSSION 

look. The sharper and more varied the look, the 
to use them most eficcth’ety. 


15 

more skill it takes 


SUGGESTED AOOITrONAL READING 

Boring, Edwin G., H. S. Lang/cld, and H. P. Weld, Foundalions of psy- 
etiology. New York, WHcy. 1948, Chapter IS. ^ ^ 

Cottle, William C. and N. M. Dounic, Procedures and preparation for 
counseling. Englewood Cliffs, N. J., Prentjee-HaU. 1960. pp. 158-162- 
165-167; 174-175; 180-183. 

Horence L,. Mental lesuag. New Yorlc. Rineban. 1949 , pp. 


Harris, Chester W., Editor, Encyclopedia of ed(iC<j/iOMa/ research, 3rd ed., 
New York, Macmillan. I960, pp. 807-816; 1502-1503. 

Lorge, Irving, The fundamental nature of measurement. Chapter 14 in 
E- F. Lindquist. Editor, Educational measurement, Washiagtoa, D. C., 
American Council on Education. 1951. 

Murphy, G., An tmiorical introduction to modern psychology, rev. cd., 
New York, Harcoun, Brace. 1949. Chapters 6, 8. 11. 24, and 26. 

Nunnaly, Jum C., Tests and measurements. New York, McGraw»Hill, 
I9S9, Chapters 1 and 2. 

Seward, Gcorgenc S., and John P. Sew-ard. Current psychological issttes, 
New York, Holt, 1958, Chapter II. 

Wrightstonc, J. Wayne, el al.. Educational measurements. J?ev. edue. Res., 
20, 1956, 268-291. 

Wfightsione, J. Wayne, Joseph Jusimao. and Irving Robbias. Evaluation 
in modern education. New York, American Book. 1956, Chapter 1. 


QUESTIONS FOR DISCUSSION 


1. The development of objective and standardized tests has proceeded 
faster and further in the United States than in any other country. TVhat 
factors do you see as contributing to ibis? 

2. Try to talk to a student from some foreign country and find out what 
examinations arc like and how they are used in his country. What differ- 
ences do you find, as compared with the United States? What are the ad- 
vantages and disadvantages of each system? 

3. In many graduate schools oral examinations are still used la exam- 
ining candidates for higher degrees. Wbal are the advantages and disad- 
vantages of this type of examination? 

4. From your reading or from your personal experience, give one or 
more concrete examples of the misuse or misinterpretaiion of the results 


from standardized tests. • a" 

5. How universally acceptable is the statement knowledge is good in 
the fieU of education and applied ps}Cbology? What objections »oui 
you have to this statement, ot whM limilalKKis would you place upon jt? 



16 


HISTORICAL AND PHILOSOPHICAL ORIENTATION 


6- Give an illustration of a measuring procedure in education or psy- 
chology that would be of liUlc or no value because it was not sufiicicnUy 
precise; one that would be of no value because it was measuring the wrong 
thing. 

7. Give two examples of educational or psychological measures to rep- 
resent each of the following four points along the scale of quantification 
and precision of measurement; (a) cither — or, (b) qualitatively described 
degrees, (c) rank in a group, (d) amount, expressed in uniform, estab- 
lished units. 

8. Your textbook states that “to know an individual means to be able 
to describe him accurately and fully." What would be central in such a 
description for 

a. A fourth-grade girl having difficulty with arithmetic. 

b. An eighth-grade boy who has been picked up for throwing rocks 
through the school windows. 

c. A recent high-schoo) graduate who is bciag considered for a job ss 
receptionist. 



Chapter 2 

r 

Overview of Measurement 
Methods 


During the present century techniques for appraising the individual 
have been developed in great variety, and they have been applied to 
many aspects of his abilities and personality. Specific techniques will 
be discussed in detail in later chapters. The present chapter is de- 
voted to a general overview, mapping out some of the main landmarks 
of the whole domain. 

appraisal by tests versus appraisal by 

OBSERVATION IN NATURAL SITUATIONS 

Attempts (0 appraise and describe a person can be grouped into 
two main categories: those that depend upon setting up special test 
situations and those that depend upon observing behavior in the actual 
naturally occurring situations of life. The usual earmarks of a test 
are that (1) it occurs at a specified time and place, (2) it consists 
of a set of tasks uniform for each person tested, and (3) it is seen 
as a lest situation by the person being appraised. By contrast, evalua- 
tion based upon the naturally occurring situations of life is likely to 
(1) extend over an indefinite period, (2) be based upon situations 
that vary from person to person, and (3) not be perceived as a test 
by the person being appraised. The distinction between test situations 
and natural life situations is not an entirely sharp and clear-cut one, 
and we will have occasion to consider some in-bclween cases. How- 
ever, it is usually clear whether we are dealing with a test as such or 
with observations under the natural conditions of life. 

In thinking about the evaluation and measurement of man, we are 
likely to think primarily of tests narrowly defined, a test of arithmetic, 
a test of scholastic aptitude, or a test of auditory acuity. But we must 
remember that many of the important appraisals we make of people 
have always been, and will continue to be, based on observaUons of 
17 



18 OVERVIEW OF MEASUREMENT METHODS 

them as they live from day to day. Appraisals of the nursery-school 
child’s insecurity in relation to other children, of the 10-year-olds 
cooperativeness, or of the junior executive’s initiative will almost nec- 
essarily be based upon observations of him over a period of time as 
he functions in his natural social group. Evaluations based on these 
observations have serious limitations. We are likely to find little uni- 
formity from person to person in either the situations observed or the 
standards of judgment of the observers. But for some kinds of be- 
havior we have no adequate tests to substitute for observations of 
natural situations — and very likely never will have. 

Any complete picture of evaluation procedures must, therefore, pay 
attention both to test techniques and to devices for improving the ob- 
servation of naturally occurring behavior. We will tend to prefer test 
situations where suitable ones can be devised. The examiner has more 
control over the situation, since he can present the same tasks or ques- 
tions to everyone in the same way. He can usually get more precise 
results from a test and results that depend less upon the particular 
person making the appraisal. However, we must recognize that many 
significant aspects of individual behavior, by their very nature, defy 
reduction to a neat test. These can be appraised validly only as the 
individual functions in a natural life situation. 

Of course, not all tests are perfectly frank and aboveaboard. We 
shall have occasion to consider various types of test instruments in 
which the characteristics appraised are not those that the test seems 
to be getting at. Outstanding in this group are the so-called projective 
tests discussed in Chapter 15. What purports to be a test of “ima^na- 
tion” may in fact be directed at revealing anxieties, tensions, and inner 
emotional conflicts. Or a test of arithmetic computation may be 
rigged to yield a measure of cheating. But these are exceptions to 
the general rule that in a test the person knows that he Is being tested 
and knows what is being tested. 

TWO FORMS OF TESTS 

Wiihin a defined test setting we may again recognize an important 
distinction, which depends upon whether the examinee leaves a per- 
manent record of his behavior or whether it must be observed “on the 
wing as it takes place. The first situation is represented by any lest, 
such as one of reading comprehension, in which the examinee marks 
his answers on a paper. The marks arc then permanently recorded 
and can be scored at leisure. The second type of test would be en- 
countered in an appraisal of oral reading, for example, where errors 



EXTERNAL OBSERVERS VERSUS SEIF-OBSERVATION 1’ 

„= noted by the listener ns they occur or the quality is judged by the 

listener as the reader spcab. e^t to reliability 

In this comparison, again, the adyaniag r 

end objectivity usuaily fall on Ute “ Vsheef or a definite 

nent -ord the test 

Sy“"sTu“1ngp.:e. >h~pt“c:nSu=“.“"^ 

^:;^r-n^r™f^r.=rSn.omaL.a.of.e 

sort that leaves a P=™“="' and many others are 
But young children cannot Again, some types 

handicapped in a lest that requires ft 

of performances, such as speahi g sometimes ue are 

to a usable permanent record. he does 

interested not merely ^ (os 5x7, does he get it quickly 

it. It a child gets the right a«« ^ ^ habituation o 

or slowly? Surely or with fumbling. y ass joes 

S^e ioict answer or be observed if the 

not show in the written ans ..,hinks out loud." 

child answers the problem ^^.^h we shall have to de- 

Thcrc ate test situations, j, takes place rather than 

pend upon observations of «>= ^ ,as, situations pM special 

'upon spring the f^'tog^^Tha. to look tor. 

that they accept as tigni, ,est mat p 

right by other observe^^ 
ing of examiners is usually q 


Rsus self-observation 
external observers VE tion of the individual's 

Aswemoveoutotam— 


As we ™v= “1 u' >“=■ ‘7„,s1der"to oK'C ”e 

“L™;en.rus. 

person’s behavior, some ^sk ‘’'"J “t„o' quite diderent 

or a member of his family. provide two q 

characteristics as he f ^ „„,s.de, the 

views of the individual, 

‘“^=e outside view is hltere^-l^"^^^^ 

of a particular outsider. 



2Q OVERVIEW OF MEASURE/AENT METHODS 

side of the youngster— the school side that is turned toward him. 
Furthermore, he sees it colored by his own prejudices and limitations. 
What he secs as “cooperation” may from another viewpoint appear 
to be docility; what he considers “insubordination” may appear to 
another to be independence. 

The self-picture is limited by the reporter’s lack of self-understand- 
ing and unwillingness to reveal himself to the watching world. We 
do not know ourselves perfectly. Some of our limitations, our petty 
meannesses and evasions, our weak and sensitive spots, we cannot 
face and admit even to ourselves. Still other shortcomings we recog- 
nize but are unwilling to acknowledge to an outsider. 

Sometimes one set of limitations will seem more serious, sometimes 
the other. If a person is applying for a job he very much wants, we 
will probably feel that we can put more trust in the evaluations of 
outsiders than in his self-evaluation. He has too much at stake In 
the impression he makes. On the other hand, if he has come to us 
for help and guidance, his own more intimate self-picture may provide 
a better basis for counseling with him dian will the impressions of an 
outsider. We shall need to become acquainted with evaluation in- 
struments of both types. 

PLANNED VERSUS RETROSPECT! VE OBSERVATION 

When we rely upon observations, cither by the subject himself or 
by others observing him from outside, we may call for new observa- 
tions made specially for us, or wc may fall back upon the informal 
and undirected observations that have occurred in the past. Suppose 
wc arc studying the individual’s tendency to become angry. We might 
ask him to keep track of all the times he got mad during the following 
week, noting down the circumstances for each anger episode, i.e., 
when it occurred, what precipitated it, what he did, etc. This would 
be an example of planned self-observation. By contrast, a second 
possibility would be to gisc him a list of situations that tend to annoy 
or irritate people. Wc might then ask him to look into himself and 
judge how readily he had tended to get angry at people who push in 
front in line, at being called by the wrong name, at being called down 
for something he did, and so forth. The self-observations would now 
be retrospective. If an outsider — say, a teacher — were doing the job, 
he might be asked to note down times during a specified period when 
he saw the particular pupil push, hit, or talk sharply to another. Or 
be might be asked to think back over his contacts with the child and 



OBSERVATION AND TEST COMBINED 21 

rate him on a scale ranging from “cxccpUonally calm and even-iem- 
pcred to flares up and gets angry at ihc slightest provocation." 

Agatn, there are advantages and disadvantages to both the planned 
and the retrospective type of observation. A major difficulty with 
systematic planned observations is that they are laborious and time- 
consuming. It takes a great deal of time and a high level of observer 
cooperation to get the necessary observations made. Partly because 
of this, the observations are likely to cover a limited time period and 
therefore to represent a rather meager sample of the individual’s be- 
havior. However, when observations are of actual current behavior, 
they tend to be more objective and less influenced by biases and the 
selective effects of memory than retrospective reports. The retro- 
spective observations called for in self-report inventories and in rating 
scales have been widely used because of their administrative simplicity 
and because they summarize concisely the whole history of self-ob- 
servation or contact with the person rated But this type of sum- 
marizing judgment gives the biases of the respondent the fullest chance 
to express themselves. 

OBSERVATION AND TEST COM8INED~THE 
SITUATIONAl TEST 

As wc noted earlier, some behavior in test situations leaves no 
record behind but must be observed as it occurs. Here we have 
something of a hybrid involving both observation and test. The ob- 
server notes the specific errors a child makes when he reads aloud 
or- his hesitations and false starts in spelling a word. Sometimes the 
“test” may involve a much more complex and total situation and more 
subtle types of behavior. In many of these “tests," the person being 
observed may not realize what is being observed (or even that he is 
being observed). So, if we want to appraise the individual’s tendency 
to get angry, we may put him in a standard anger-producing situation. 
For example, we may give him a job to do and two intentionally stupid 
assistants who keep making mistakes and getting in the way. In so far 
as we are able to present each subject with the same task, we have 
a test situation. But wc must depend upon the observations and judg- 
ments of outsiders to evaluate his behavior. 

These complexly structured lifelike situations, which strive for the 
uniformity of a test situation and yet for the naturalness of real-life 
events, may be called situadanal tests. They represent a compromise 
between the objectivity and standardization of the testing approach 
and the naturalness of a real-life situaiiOD. This approach presents 



22 OVERVIEW OF MEASUREMENT METHODS 

interesting possibilities for getting at types of behavior that do not 
readily lend themselves to the conventional types of testing. 

The practical problems faced in devising situational tests arc very 
great. They call for elaborate staging if the naturalness of real life 
is to be presersed. In addition, the problems of obtaining satisfactory 
observations and adequate reports of them remain. For these reasons, 
situational tests have not been widely used- But they present an in- 
teresting type of tool, whose possibilities are only be^nning to be 
explored. 

FUNCTIONS FOR WHICH MEASUREMENT HAS 
BEEN UNDERTAKEN 

Broadly speaking, psychologists and educators have been interested 
in measuring in two general areas, what a person can do and what he 
will do. Measures of the first sort are measures of ability. In our 
discussion we will divide ability measures into measures of aptitude 
and measures of achievement. Again, roughly speaking, an aptitude 
lest undertakes to measure what a person could learn to do, whereas 
an achievement test measures what be has learned to do. 

The distinction between aptitude tests and achievement tests is far 
from a clear one, because we often use what a person has learned as 
a cue to what he can learrL Thus, a measure of the amount of knovsl- 
edge of mechanical devices a person has gained in the past may be 
one of the most accurate indicators of the amount of further knowl- 
edge of things mechanical be will acquire in the future. The clearest 
distinction between aptitude and achievement tests lies in the direc- 
tion of our interest. In an aptitude test, our interest is to predict 
what the individual can learn or develop into in the future; in the 
achievement lest our interest is in what he has learned in the past 
Measures of the second major category — of what the person will 
do — correspond to the area we may roughly label personality meas- 
urement This is a somewhat broad and loose definition of person- 
ality. It is also a somewhat external one. That is, we has'e indi- 
cated a concern for what a person does rather than for how he feels 
or what his inner urges and conBicts arc. We may be interested in 
those to a degree. But, so far as a testing or observational procedure 
is concerned, it is always based on what a person does — ^how he acts, 
what answers he marks, or what he says. His actions are the basic 
material that we study. 

In the long run, his future actions are what we want to predict: 
whether he will graduate from college, whether he will continue in 



FUNCTIONS FOR WHICH MEWUBEMENT HAS 8SEN UNDERTAKEN 23 

and apply himself la a clerical type of job, whether he will behave 
w a more socially acceptable fashion after a particular type of therapy. 
Wc may perhaps make these predictions more surely if we organize 
the test and observational appraisals around certain concepts of in- 
terests,^ needs, or conflicts. But these terms describing the inner life 
of the individual represent inferences that we make as a way of struc- 
turing and organizing the observations of the Individual's behavior. 
We cannot see a need for approval. What ^ve observe is that a child 
brings things into class, attempts to talk at all times, buys candy for 
other children, and tries to join any social group in the playground. 
We may ittier a need for approval as an underlying factor related to 
the various behaviors. 

When wc try to measure what a person Mill do, as distinct from 
what he con do, we encounter some special problems. These are 
primarily problems of intentional distortion of the test results. In 
an ability lest we want each individual to try hard and do the best 
he can. But in personality measures, we do not want to know how 
cooperatively a person can behave or how energetic be can be. We 
svant to know to what degree he typically does show energy or behave 
in a cooperative manner. In a limited test situation, where the nature 
of the test is clear to the examinee, everyone can put his best foot for- 
ward. He can probably muster up all the virtues for a special occa- 
sion. But will he in other situations? It is this question, the question 
of the degree to which behavior in an idcntiliable test situation will 
represent behavior in real life, that pushes us into disguised tesu and 
into observational evaluations of personality characteristics. 

ASPECTS OF PERSONAUTY 

It will be convenient to use a number of terms to refer to certain 
fractions or aspects of personality that we may wish to evaluate. 
These terms and the meanings that attach to them are discussed 
briefly below. 

Character. Character traits are aspects of individual behavior to 
which a definite social value has been attached. Honesty, coopera- 
tiveness, thrift, kindliness, and loyalty are all labels for social virtues. 
Educational and religious organizations have always been concerned 
with the inculcation of such virtues. Based on this concern there have 
been developed a number of evaluation procedures that wc shall refer 
to as measures of character. 

Adjustmem. Educators and psychologists have long been con- 
cerned with the concept of adjustment. The mental hygiene approach 
as applied both in and out of school has striven to develop well- 



24 OVERVIEW Of MEASUREMENT METHODS 

adjusted personalities." Maladjustment is recognized in individuals 
who fail to fit into the social group or who appear to live unhappy 
and unproductive lives. As with character, degree of adjustment 
represents a social judgment, and what is conceived to be well-ad- 
justed behavior varies from one culture to another, depending upon 
what is normal for that culture. Normal behavior in our competitive, 
acquisitive society might seem pathological if transferred to a South 
Sea island. Adjustment will mean, then, behavior patterns that en- 
able the person to get along in and be comfortable in his social set- 
ting — typically, the setting of middle-class, twcntieth-ccntury Ameri- 
can-Europcan culture. We shall encounter a group of instruments 
designed to evaluate deviations from this norm — the tendency to show 
maladjusted behavior or behavior typical of people who do not get 
on happily and successfully in our culture. 

Temperament. From early days observers of human nature have 
noted conspicuous differences in energy level, prevailing mood, and 
general style of life. Literary men and men of science alike have 
proposed systems for classifying temperaments. Hippocrates, for ex- 
ample, proposed that men could be divided into the sanguine (ener- 
getic and cheerful), choleric (energetic and irascible), phlegmatic 
(sluggish and placid), and melancholic (sluggish and sad), and pro- 
posed physiological bases for these distinctions. There have been 
many other classifications before and since. Appraisals of such di- 
mensions as these we shall speak of as measures of temperament. 

Interest. The individual makes a variety of choices with respect 
to the activities in which he engages. He shows preferences for some, 
aversion to others. Appraising these tendencies to seek or avoid par- 
ticular activities constitutes the domain of interest measurement. 

Attitude. The individual responds with enthusiasm and aversion 
not only toward activities but also to social groups, social institutions, 
and the other aspects of his world. These reactions, with their various 
ramifications, constitute the individual's constellation of attitudes. 
Various devices have been developed for evaluating these prejudices 
pro and con, and these constitute the field of attitude measurement. 


CONCLUDING STATEMENT 


In summaiy, then, approaches to the measurement of the individual 
cover a great diversity both of methods and of content areas. Varia- 
Uons of method may be represented by the following outline; 


I. Ten merhods. involving a defined task and testing period 

A. Permanent record or product available for scoring or analysis. 

B. Process must be observed and evaluated as it occurs. ^ 



QUESTIONS FOR DISCUSSJON 

11 .n.,l,oJs. .n Which behcor „ ob.cr«d in ihc namrcl 

“hen. m which h.c individual repo, is on hi, own rc- 

■ .0 cover a apeci- 

fied period. . present memory and 

Retrospective observation, based on v 

“ cvolnation of par, reaction., 

■'•rreporeltr.Sct^Urcac.iona 

1. Planned observations. 

2. Retrospective observation^ a test but 

S:tn = - " "h 

arc elaborated in evaluation procedures have 

'“"ridhiliiiVr, =vidcn=esj_ what ^ 

-«o.""_dred.oshow«ha^ 

-frr'7"tahi,i.,o«---"'^“^‘"‘" 

c. Tciiiperc'ire'''' 

D. 

''■ fd;^'rconccp«thaire»k'“P*^^^^^ „„ 

This analysis of -P--'.rin"= 'f rers”'"" 
detailed. However, it ' („B„ing chapters. 

which we shall be concerned in 

, -OJ discussion 

questions FO erasures are Icm s»“' 

be aitreed that personality give rise 

1 I, would E'ncre® “LS „ achicvemcnl. 

,.,‘iory than measures of apured .be classiECion 

'“f'gow would you J^Stl'^FtcrT 
mcasurcmcnl methods S» 



OVERVIEW OF MEASURE/Aa>JT METHODS 

a. Anecdotal lecords kept by a teacher, describing behavior in his class- 
room. 

b. An autobiography written by a pupil for a high-school counselor. 

c. An individual intelligence test in which both questions and answers 
are given orally. 

d. A Boy Scout's record of “good deeds/’ kept over a 2-week period 
and reported to his Scoutmaster. 

3. Illustrate, from >our reading or experience, each of the categories of 
measurement methods in the outline on pp. 24—25. 

4. How would you fit each of the following into the outline of aspects 
of the individual to be evaluated, given on p. 25? 

a. Observations of how well a bigh-school student gets along with adults. 

b. A pupil’s expression of his preferences for books in an annotated list 
of titles. 

c. A kindergarten child’s performance on a test of readiness to learn 
reading. 

d. A pupil's performance on an English test, used to place him in the 
appropriate section. 

e. Ratings of a pupil on his loyalty to his friends. 

5. From your reading or personal experience, give an illustration of 
measurement procedures for each of the aspects of the individual identi- 
fied in the outline on p. 25, 

6. A class has just finished a unit on etiquette, and the teacher wishes 
to evaluate the effectiveness of the unit. Which of the methods outlined 
on pp. 2^25 might she use? What would be the advantages and limitations 
of each? 



Chapter 3 

T 

The Teacher’s O^v^n Tests 


In this book dealing with educational and psychological measure- 
ment procedures, we have elected to start with a consideration of the 
teachers own tests. We have done this for several reasons. In the 
first place, informal test making is an operation that is famaiar to 
every teacher, and the outcomes of such test making are familiar to 
every student. In the second place, because the teacher-made test is 
so widely used and has such an important place in evaluating student 
achievement, it strongly influences students* views toward tests and 
test-taking specifically and toward education generally. In the third 
place, the techniques of testing available to every teacher form the 
backbone of standardized tests of achievement and of aptitude. Fur- 
thermore, the quality of the items on a standardized test and the ade- 
quacy of the coverage of a standardized test are judged by precisely 
the same standards that apply to teadier-made tests. 

THE ROIE OF TEACHER-MADE TESTS 

Evaluation of pupil progress is a major aspect of the teacher’s job. 

A good picture of where the pupil is and of how he is progressing 
is fundamental to effective teaching by the tcjicher and to effective 
learning by the pupil. The evaluation * procedures the teacher uses 
with his group serve a number of (unctions. We will identify four, 
commenting briefly upon each of them. All the procedures the teacher 
develops for pupil evaluation may serve these functions, but we shall 
be concerned to point out bow they may be served by the more formal 
evaluation instruments colled tests. 

* The leroi •‘evaJuatJon’’ as we use « here is closely related to measurement. 

It is m some respects more inclusive, including informal and ioimtivc ;udgmenis 
of pupil progress. It also includes more definitely the aspect of valuing— of say- 
ing what is desirable and good. Good jneaMircmcnt techniques provide the solid 
foUDdatlon for sound evaluation, wheiher of a single pupil or of a total cur- 
riculum. 


27 



28 


THE TEACHER'S OV/N TESTS 


mot/vation 

To some degree, varying from pupil to pupil and from class to class, 
tests determine when students study, what they study, and how they 
Study. Tests that are vs ell constructed and effectively used can mo- 
tivate students to develop good study habits, to correct errors, and 
to direct their activities toward the achievement of desired goals. 
Tests that are poorly constructed or used puniiivcly can just as effw- 
tively discourage the students or misdirect their learning. Testing 
procedures control the learning process to a greater degree, perhaps, 
than any other teaching device. 


D/AGN05tS AND INSTRUCTION 

Testing serves to diagnose weaknesses and to provide practice for 
available knowledges and skills. The items on which an individual 
fails or on which many members of a class group fail can serv’e to 
identify points needing further study whenever the test task is suffi- 
ciently precise for the nature of the failure to be identified. The func- 
tion of a test as a rehearsal of knowledge and a guide for further study 
has long been recognized. 


OEP/NING TEACHING OBJECTtVES 

What a teacher emphasizes in his evaluation of pupils, and particu- 
larly in the more formal evaluation represented by tests, defines to his 
students what that teacher considers important. This definition is 
presented in a much more forceful way than any pretty speeches that 
the teacher may make. The teacher may avow, to his students or to 
his colleagues, that he considers the ability to apply facts to real situa- 
tions and to understand basic principles lo be much more important 
than just learning facts. But if his tests ask only for names, dates, 
places, and sentences from the book or his lectures, those will be his 
functional objeaives, and those will be the things that his students 
will study — the docile ones who are influenced by him anyhow. We 
may know a teacher by the tests he makes. They tell what he is truly 
valuing m his pupils, even though he himself does not know it, and 
they influence profoundly what his students will leam. 

DIFfERENTIAIION AND CERTIflCATION OF PUPILS 

The teacher inevitably has a responsibflity for certifying pupils’ 
accomplishments to higher levels of the educational enterprise and 
10 lie world outside the school. The testioj procedures he uses help 



PUNNING THE TEST 29 

him to arrive at the judgment that is recorded in his mark, letter of 
recommendation, or other evidence of approbation or disapprobation. 

In view of the many functions they serve and in view of the dis- 
service that may be done the pupil from poorly conceived or executed 
evaluation instruments, it is important that the teacher’s evaluation 
devices be well thought out and well made. To evaluate the range 
of outcomes in which a modem school is interested — understanding 
as well as knowledge, appreciation as well as skill, ability to apply as 
well as to reproduce, attitudes and interests as well as achievements— 
the teacher must call upon a variety of types of appraisal. He must 
profit from observation of classroom performance by recitation, by 
participation in informal discussions, by contribution to group enter- 
prises. He must size up the student in o^nference, interview, and in- 
formal discussion. He may have occasion to rate the products pro- 
duced in laboratory or shop and to appraise the quality of assignments 
carried out outside of school. He will also almost certainly make 
some use of class tests. Some of the objectives of his teaching can 
be measured efficiently, realistically, and completely by pencil-and- 
paper tests. Some can be measured only partially by such means. 
Some cannot be measured at all in (his way. This chapter and Che 
next are concerned primarily with those objectives that can be meas- 
ured with tests and with the improvement of testing procedures to 
measure them. Some consideration will be given to observational 
procedures, ratings, and other types of appraisal deWces in later 
chapters. 


PLANNING THE TEST 

The primary function of any evaluation procedure is to determine 
to what extent students have achieved the objectives of instruction. 
If a test is to serve this function effectively, it must be planned with 
that end in view. A test which “just growed” is unlikely to correspond 
very well to the teacher’s stated objectives. This is particularly true 
in the case of objective tests, and it is here that careful planning is 
especially important. However, one should not overlook the impor- 
tance of a good test plan even in the case of an essay test. 

If the teacher just sits down and writes objective test items, the 
test is likely to be out of balance. It is easier to write simple factual 
items than it is to write items that cdl for understanding of generali- 
zations or application of principles. It is easier to write items on 
bomc topics than on others. As a result, the teacher is likely to end 



30 THE TEACHER'S OWN TESTS 

up with an overload of items calling for simple information about 
the more testable topics. The same thing is true, to a degree, of essay 
tests. The outcomes measured by the lest will then show a poor cor- 
respondence with those espoused by the teacher. What the pupils 
emphasize in their learning will soon follow what they find is em- 
phasized in their tests, and the tests will fail to foster the learnings 
in which the instructor is most interested. 

DEf/N/NG OBJECTIVES 

The thoughtful planning of a test involves several steps. The first 
and most important step is to define the objectives that are to be ap- 
praised. Before he can evaluate whether a student has achieved the 
objectives of instruction, a teacher must be able to stale what the 
student was supposed to have achieved. Moreover, objectives that are 
to be evaluated roust be stated in terms of pupil behavior. We must 
be able to specify the processes or activities that a student is expected 
to display if he has achieved the objectives. What do wc expect him to 
know? What kinds of applications do wc expect him to be able to 
make? How do we expect him to think or to solve problems? What 
actions on his part will show that he has acquired the attitudes that wc 
ate trying to inculcate? In other words what things must a student do 
to show that he has acquired the knowledge, understandings, skills, 
attitudes and appreciations that we say wc have been trying to teach. 

The failure to define objectives in terms of student behavior prob- 
ably accounts for much of the inadequacy in evaluation of student 
progress in schools and also for the very poor quality of many class- 
room tests. Defining the objectives of instruction in terms of pupil 
behavior is not an easy task but it is necessary before a good test can 
be constructed or clTectivc evaluation can be done. 

The real work of defining objectives must usually be done by the 
teacher himself, perhaps assisted by his colleagues, and working from 
his textbook or course outline. In many schools the teacher has avail- 
able a curriculum guide or course of study which does contain a set of 
objectives. But objectives listed in these sources lend to be too vague 
and global to be useful as a guide for evaluation. They need to be 
broken down into more specific components if they arc to provide a 
sufficiently exact definiUon of just what the broad, global objectives 
mean. 

Let us look at an actual example. In the section below are lUted 
the objectives stated as the desired outcomes for an eighth-grade social- 
studies unit on the funcUoning of our national government. 



PLANNING THE TEST 


3) 


Objectives of a Unit on How 0«r National Government Functions 

1. Has a basic fouudatjon of facts and m/omtation necessary to aa 
underslamling of ihe unit. 

2. Understands why the Declarattoo of Independence was written. 

3. Understands the ideas embodied in the Declaration of Indcpcnd- 
cace. 

4. Understands the Articles of Confederation. 

5. Understands Articles I through VH of the Constitution. 

6. Understands the Bill of Rights and other amendments to the Con- 
stitution. 

7. Can use and interpret maps. 

8. Can locate and interpret data. 

9. Can do critical thiniing. 

10. Derives personal satisfaction from social studies reading. 

11. Is able to plan, execute, and evaluate committee pro;ccts. 

12. Uses parliamentary procedures. 

13. Develops a love for and loyalty to the principles of the sovemment 
of the United States. 

14. Develops an abiding interest in civic affairs beyond the >ears of 
formal cducatioa. 

15. Has an appreciation for the principles of a democracy which formed 
the basis of our govemmcni. 

As stated, these objectives embody many of the faults typically 
found in the statements of objectives available to teachers in courses 
of study. Some of these may be pointed out and illustrated. 

1. The Objective Is One That Cannot Be Achieved, and Certainly 
Cannot Be Evaluated within the Unit. Objective 14 refers to adult 
life, and not to anything that characterizes the pupil in the cishih 
grade. It is an expression of a pious and worthy hope, but of little 
help in guiding the teacher as to what he should do with a pupil or 
look for in him. 

2. The Objective Is Expressed in Terms of Vnobsen-ables. Ob- 
jectives 13 and 15 state uorthwhile hopes, but provide no guidance 
to the teacher as to what the eighth grader is to do to show hU love, 
loyalty, and appreciation. How docs a student exhibit appreciation 
for the principles of democracy? Through giving Itp service to the 
words and symbols? Through accepting individuals viho differ from 
himself in various ways? Through participating in school govern- 
ment? Bchaviorally, what do these objectives mean? 

3. The Objecthe Bears Utile or No HdathivJvp to the Content 
of the Vnii. Objective 7 is one that has little relationship to 

being studied. There arc certainly many better units in which to build 
up map-reading skills. Such skills play very minor roles, if any. in 
this particular content. 



32 THE TEACHER'S OWN TESTS 

4. The Statement of the Obfective Implies a Process Quite Di0er. 
eiu from the One That Is Taught. Objectives 2 through 6 start with 
the word “understand.” But what does the word really mean in this 
context? Consider 2, for example. Further study of the curriculum 
guide brings out that the pupil is to “understand” that the Declaration 
of Independence was written to explain America's cause in the NVar 
of Independence. However, this point is specifically and explicitly 
made in the textbooks. A pupil could produce this statement on an 
examination on the basis of direct recall of what he had learned. No 
real understanding is called for. The same is true of objectives 3 
through 6. These objectives should more appropriately begin with 
“knows,” “can recall,” or “can state,” because these words more ac- 
curately reflect the process of reproducing from memory. 

The term “appreciation” in objective 15 is also often used quite 
loosely. It is often used to refer to information about something, 
rather than an affective or aesthetic reaction to it. Thus, again, it 
may be more realistic to talk about knowledge of the principles, and 
perhaps ability to interpret or apply them, than about “appreciation” 
of them. 

5. The Objective May Be Inappropriate to the Level of Instruction. 
To be realistic, we can expect eighth graders to achieve only a very 
limited understanding of this particular unit. The teacher should 
recognize this, and place “understanding” in a proper perspective both 
in the weight that is given to it and the level of sophistication that is 
expected. 

The teacher can both develop and test for the level of understand- 
ing appropriate at a given grade level if he can identify the processes 
on the part of the student that represent understanding. The student 
can show understanding at different levels by ( 1 ) expressing concepts 
and principles in his own words, (2) pointing out similarities and dif- 
ferences not explicitly pointed out in class or text, (3) pointing out re- 
lationships, and (4) applying his information to situations about which 
he has not been taught. 

We could point out other specific defects in the list of objectives 
as they arc staled, but let us slop and see bow the list might be re- 
vised so that the objectives would provide a better guide both for 
teaching and evaluation. For this purpose we want to separate the 
pr^ss from the content. A good guide for setting up objectives that 
relate to process rather than content is given by Bloom et al.» A re- 
vised SCI of process objectives for thb unit is given below. When the 
word knows is used, it means the ability to reproduce, recall, recog- 
nize or define. » ♦ & 



33 


PtANNfNG THE TEST 

1. Knows terms and racabulary. 

2. Knows dales, events, persons, and places. 

3. Knows gcncraluaiions. concepts, and principles. 

4. Can trace ibe sequence ot histories] development (of our national 
form of government.) 

5. Can express generalizations and concepts in his own words. 

6. Can point out relationships, simi/anties. and ditrerenccs. 

“7. Can apply generalizations and principles to particular concrete sit- 
uations (hat arc new to him. 

8. Can use parliamentary procedures in committee or class meetings. 

9. Can plan, c.xecutc, and evaluate committee projects. 

10. Shows support either orally or in writing for governmental actions 
that follow the democratic principles on which the govcrnmcai is 
based. 

U. Expresses concern either orally or in writing for an individual or 
group being deprived of the rights guaranteed by our constitution. 

This set of objectives keeps the basic intent of ihs original set of 
objectives. For example, objectives 1 through 7 in the revised list 
include the intent of objectives I through 6, 8. 9, and ]5 on tile orig- 
inal list ot objectives. Objectives 8 and 9 in the revised list parallel 
11 and 12 on the original list. Objectives 10 and I ! on the revised 
list arc rcdciinttlons of 13 on (he original list. Those objectives on 
the original list that were totally unrealistic or did not apply to the 
unit have been eliminated in the revised list. 

OUrUNING CONTENT 

The second step in planning for a test is to outline the content to 
be covered. The outline of content is important because the content 
is the actual vehicle through which the process objectives are to be 
achieved. An outline of content for our illustrative example follows, 

OwJlin* of Content of a Unit on How Our Notional Government Function* 

I, The Foundations of Our Constitutional Government (Time allot- 
ment; 2 weeks) , „ . • c 

A. Early English documents — the Magna Cbarta, the Peinion oi 

Right, and the Bill of Rights 

1. Provisions of these documents 

2. Their influence on our heritage 

B. Mayflower Compact, FundamenUl Orders of Connecticut, Tne 

New England Confederation, The Albany Plan 

1. Principles embodied in these documents 

2. Influences of these documents 

C. Representative Assemblies io the Colonies 



THE TEACHER'S OWN TESTS 


D. Declaralion of Independence ^ , .• 

1. Events leading up to the drafting of the Declaration of Indc- 

2. Drafting the document; ideas embodied in the declaration 

a. Principles of government the delegates believed in 

b. List of grievances against King George III 

c. A declaralion of freedom 

3. Adoption of the Declaralion of Independence on July 4, l / / o 

E. The Articles of Confederation (1781-1789) 

1. Provisions 

2. Weaknesses 

3. Constitutional convention called by Congress in 1787 

F. The Constitutional Convention — 1787 


1. Members 

2. Stated purpose 

3. Need for strong national government 

4. Disagreements solved by debate and compromise 

5. Contributions of influential men at the convention 

6. Ratification of the Constitution 

II- Principles and Development of Our Constitutional Government 
(Time allotment: 4 to 6 weeks) 

A. Purpose as stated in the preamble 

B. Federalism versus Confederation 

1. Division of powers between central government and states 

2. Advantage of federalism 

C Uoitao' and Federal Government 
D- Division of power in our federal s>'stcm 
1- Reasons for separation of powers 

2 . Division of powers in federal government 

3. Division of powers among local units 

4. Limitations placed on federal government 

E. The Three Branches and the System of Checks and Balances 

1. The executive, legisIalK'e. and judicial branches 

2. Meaning of checks and balances 

F. The Articles of the Cooslitution 
I. Provisions of the articles 

G. The development of the Constitution 

1. Elastic clause 

2. Cbangmg the ConsUtution 

a. Amendments — process 

b. Statutes 

c. Court decisions 

d. Customs and usage 

H. The Bill of Rights 


P«£PAWNG IHe TEST BtU£?R;NT 

The content ouUine and a statement of process objecUves represent 
the two dimensions into which a test plan should be fitted. These two 
dimcnstoiis need to be put together to give a complete framework and 



35 


PLANNING THE TEST 

to see which objectives relate espedally to which segments of content. 
In planning for the total evaluation of a unit the teacher would be well 
advised to make a blueprint cov^g all objectives and add a column 
at the extreme right indicating the method or methods to be used in 
evaluating studeat progress totvard achieving the objectives. However 
in making a blueprint for a test, only those objectives that can be 
measured either wholly or in part by a paper'and-pencil test should be 
included. 

In the list of revised objectives, objectives 8 through 1 1 cannot be 
measured by a paper-and-penci) test. For example, objective 8, “Can 
use parliamentary procedures in committee or class meetings,” can be 
evaluated only by putting the student in a somewhat formal meeting 
and observing whether he follows parliamentary procedures. On a 
paper-and-pencil test the teacher could determine whether a student 
knows parliamentary procedure but not whether he uses it. Since a 
student cannot use parliamentary procedure unless he knows it, such 
testing of knowledge is sometimes worthwhile, but the teacher should 
remember that in this unit the objective was staled In terms of using 
rather than of knowing. Observation of performance in assigned class 
groups and in informal school activities would seem promising ap> 
proaches to evaluating objectives 9 through 11. The teacher should 
remember that no single test or evaluation medium can measure all 
the objectives that he is trying to achieve. 

Figure 3.1 shows the process objecti%'es in our revised list that can 
be measured by a papcr*and-pcncjl test and their relation to the con- 
tent of the unit. The objectives are listed In the left-hand column of 
the chart and the tides of the two major content areas arc used to 
bead the two right-hand columns. Each cell is then filled with notes 
suggesting terras, dates, events, generalizations, relationships, or ap- 
plications that a teacher might consider important specific examples of 
that content and that process. For example, the upper left hand cell 
contains four terras — inalienable rights, tyranny, compromise, and 
confederation — that tins teacher considered it important for students 
to know. The cell below that contains ceiia'm events, dates, and docu- 
ments that this teacher considered important. The other cells have 
been filled in the same way in order to specify more precisely the con- 
tent to be included and what the student is supposed to be able to do 
with that content. ... , « 

The preparation of such a two-dimensional outlme is undoubtedly 
an eMciing and ttaa-consarains task. The busy dassro^ teacher 
may often fall short of achieving such a complete analysis. There js no 
question, however, that allcropting the analysis wiU go far toward clan- 



36 


THE TEACHER'S OWN TESTS 






38 THE TEACHER'S OWN TESTS 

fyin? the objectives of a particular unit and toward guiding not^ only 
the preparation of a sound test but also the leaching of the unit itself. 

Ooce the basic outline has been prepared, the test maker must de- 
cide upon the relative emphasis to be given to the several content 
areas and process objectives. The number of questions that can be pre- 
sented in a test is limited by the testing time available and by the 
ability and background of the students to be tested. Since lime docs 
not permit him to include everything, the test maker must select a sam- 
ple of questions. The sample should truly represent the emphasis given 
in his teaching, both with respect to content and with respect to the 
process objectives. This can be done by having the proportion of ques- 
tions in each content area correspond to the proportionate emphasis 
^\cn to that topic, and the proportion of items calling for each process 
correspond to the importance the teacher considers that process to 
have in the learnings that the pupils are to achieve. The decisions 
made by the teacher in dividing up the questions on a lest are neces- 
sarily subjccdve ones. The basic principle underl>'ing these decisions 
is that the test should maintain the same balance in relative emphasis 
on both content and mental processes that the teacher has been trying 
to achieve through bis instruction. This allocation of d)0ering num- 
bers of items to different topics and process objectives is one way of 
weighting these topics and objectives diffcrenlially in the test 
In the illustration of Figure 3.1, the test maker decided that about 
30 per cent of the items should be on Topic I and 70 per cent on 
Topic 11. This corresponds roughly to the allocation of teaching lime 
to the two topics given on pp. 33-34, i.e., 2 weeks to Topic I and 4 to 
6 weeks to Topic II. Time spent on the topic and space given to it 
in the textbook can help to guide the teacher's judgment of the basic 
importance of the topic and weight to be mven it. 

The test maker in Fig. 3.1 allocated 35 per cent of the total num- 
ber of items to abjective 3, 25 per cent to objective 4, 10 per cent 
each to objectives 1, 5, 6, and 7, and 5 per cent to objective 2. These 
reflect a judgment by this lest maker that objective 3, knou ledge of 
concepts, generalizations and principles, is clearly the most important 
objective of the unit, that objective 4, ability to trace sequences of 
historical events, is next most important, and that the others are of 
about equal, but lesser importance. Names and dates (objective 2) 
are relegated to a minor role. Note that here, as in the topical outline, 
100 per cent has been dlstiibuted among the different cateaories. 

The test maker must also decide now whether he will use essay test 
questions or short-answer, objective items, and if he decides on ob- 
jeedve items he must decide which type or i>-pes he will use. The 



PUNNING TME TEST 39 

choice is goveraeJ, at least in part, by the objectives to be measured. 
The neat section of this chapter will provide some discussion of the 
advantages and disadvantages of essay questions, in relation to the ob- 
jccitvc type of item. Different types of objective items will be de- 
scribed and compared in Chapter 4. 

At about this point the total number of essay questions or objectiv’e 
test items must be decided upon. This is primarily a function of the 
lime available for the test and the type of items being used. Differ- 
ent types of objective items differ in the time allotments they require, 
and, ol course, an essay question demands a great deal more time than 
an objective item. It is almost impossible to state in general terms 
how much time should be allowed per item for objective items of a 
specific type. The appropriate time allowance is affected fay a host 
of different factors. Among the most important are (I) the age of 
the pupils being tested, (2) the length and compJe.xity 0 / ilic hem, 
(3) the type of objective being tested — knowledge of fact or concept 
versus application to new situation, (4) the amount of computation, 
if any, required by the item, and (5) the relative interest of the exam- 
iner in speed versus power — the amount the pupil can do with un- 
limited lime. 

In general, it seems undesirable to emphasize speed in an achieve- 
ment test designed to measure one's range of information or ability 
10 apply knowledge. Most teacher-made tests should be power tests; 
i.c., there should be enough time so that at least SO per cent of the 
students can attempt to answer each item. As a teacher becomes fa- 
miliar w'ith the kinds of students be usually has in a class, he wilt be 
able to judge the number of items he can include in a given amount 
of testing lime while still having a power test. As a rough rule of 
thumb, the typical student might require from 30 to 45 seconds to 
read and attempt to answer a simple factual type multiple-choice or 
true-false item, from 75 to 100 seconds to read and attempt a fairly 
complex item requiring problem-solving or some computation, if the 
test items are based on a reading passage, tabular material, map, or 
graph, time must be allowed for reading and examining the material. 

Adequately to sample achievement in a large segment of work, i.e., 
the content of a whole semester, may require more items than can 
reasonably be included in a single-period test. The only satisfactory 
solution to this problem is to allow two or more periods for testing. 

If a single unit of sufficient length is unavailable or seems likely to go 
beyond the attenrion span of the group, the natural solution is to 
break the test up into two or more subtests that can be given on suc- 
cessive days. 



^5 THE TEACHEB-S OWN TESTS 

Once the teacher has decided upon the total number of items to 
be included in the lest, he should go back to the blueprint and deter- 
mine how many items are needed for each cell. ^ in the sample blue- 
print, Fig. 3.1, the total time available for testing was 50 minutes. 
The lest maker decided to have a total of 60 questions on the lest. 
Applying the percentages in the blueprint, approximately 18 questions 
should be on Topic I (30 per cent) and 42 questions should be on 
Topic II (70 per cent). The 18 questions on Topic I arc distributed 
in the cells of that column according to the weights assigned to the 
objectives. To obtain the number of items for each cell, one multi- 
plies the number of items for Topic I (18 items) by the percentage 
assigned to the objective in each row. For example, to determine the 
number of items for the first cell in Topic I, wc multiply 18 by 0.1 
(10 per cent) which gives 1.8 hems. Since this product is between 
1 and 2, wc can note that we should have either one or two items 
covering this content and this objective. The other cells in the blue- 
print arc filled in by the same process. It is probably desirable to 
indicate a range of frequency for each cell, as was done in our ex- 
ample, in order to provide flexibility if difficulty is encountered in 
writing acceptable items for certain cells. The frequencies arc to be 
thought of as a guide and not as a strait jacket 
After all the items for the test have been consiruacd, the teacher 
should make a final check by sorting the items in piles to match the 
blueprint in order to make sure that the two agree. 

The above discussion of allocation of items to topics and objectives 
applies primarily to objective tests made up of a large number of 
items. The same degree of analysis hardly applies to an essay exam- 
ination, which will at best be composed of a relatively small number 
of items. But these few items should also be distributed over the 
content and the process objectives so that the test represents as well 
as possible the explicit goals of instruction. 

A final decision that comes in as part of the preliminary planning 
Mneems the desired difficully of the test items. The decision depends 
in part upon the purpose of the test When the lest is to measure 
/^tery of the basic wsemials in an area, the questions should be 
limited to basic essentials. If the unit has been well taught, all the 
Items may then turri out to be very easy for the group. When the 
purpose of the lest is to discriminaJe levels of achievement of differ- 
ent members of a group, i.c., to serve as a basis for ranking or grading, 
some Items should be very easy, most of them should be of moderate 
dJlTiculty, and a few should be difficult enough to spread out the ablest 



ADVANTAGES AND UMIIATIONS OE ESSAY AND OBJECTIVE TESTS 41 

members of the group. Difficulty, ta this context, is defined in terms 
ot the percentage of examinees who get the item right. 

Our test plan now consists of: 

1. An outline of content and objectives. 

2. Specific suggestions of what might be covered under each com- 
bination of content and objective. 

3. An allocation of per cents of the total test by content area and 
by objective and an estimate of the total number of items. 

4. Specifications for the spread of item difficulties. 

The next task is to prepare the actual test items. In the remain- 
der of this chapter we will discuss the choice of item types and guides 
for improving the writing of essay questions. In Chapter 4 we will 
discuss guides for improving objective-type items. 

ADVANTAGES AND IIMITATIONS OF ESSAY 
AND OBJECTIVE TESTS 

Teacher-made tests may be divided into (wo broad categories, essay 
or free-answer tests and objective tests. One bears many arguments 
about whether essay tests or objective tests should be used in schools 
but these "eiiher-or” arguments are pointless. Neither the essay test 
nor the objective test is satisfactory as the sole type of test to measure 
academic achievement. Each type has its own advantages and limita- 
tions and each has its place. The problem is to use each type of test 
in those situations where its advantages arc maximized and its weak- 
nesses minimized. 

7HE ESSAY JEST 

The essay test consists of such problems as: 

Compare the organization and powers of the central government 
under the Articles of Confederation wjtb the organization and pov\- 
ers of the central government under the Constitution. 

Why did the merchants and burincss men particularly desire to 
have the Articles of Confederation changed? 

The Fifth Amendment to the United States Constitution states that 
no person shall be deprived of life, liberty, or property Huhont due 
process of law. In your own words, explain what the underlined 
part of the statement means. 

Why is the Magna Charta considered to be an important milestone 
in the establishment of a democratic govemmem? 



^2 THE TEACHER'S OWN TESTS 

Tie essemial cbaraeleristics of the Usk set by an essay test are that 

each student 

1. Organizes his own ar^wers, with a minimum of constraint. 

2. Uses his own words (usually his own handwriting). 

3. Answers a small number of questions. 

4. Produces answers having all degrees of completeness and ac- 
curacy. 

In these characteristics lie both the strengths and weaknesses of the 
essay examination. Let us consider each in turn. 

The Student Organizes His Own Answers. Herein lies the distinc- 
tive advantage of the essay examination. It requires the student to 
produce, rather than merely to recognize, the answer. Thus, it mim- 
mizes the possibility of getting the answer by blind guessing or by 
using little cues to outguess the test maker. It can, if the questions 
are well prepared, bring out the examinee's ability to select important 
facts or ideas, relate them to one another, and organize them into a 
coherent whole. Emphasizing this integrative type of product, it 
elicits, so it is claimed, better study habits in those who are preparing 
for it. 

The Answer Is in the Student's Words and Handwriting. At this 
point a premium is placed upon verbal fluency and skill of expression. 
The student who is able to write effectively will often get a higher 
grade than another student who clothes the same ideas in less attrac- 
tive garb. Too often verbal Queocy and aggressive salesmanship, 
bluffing, in short, pass for knowledge of the subject. In addition to 
skill in writing, quality of handwriting frequently influences the grade 
on an essay tesL How often has a student been penalized because the 
instructor became irritated by poor handwriting, or could not be 
bothered to decipher obscure "hen tracks”? Effective written ex- 
pression and good penmanship may be legitimate objectives of the 
educational enterprise, but they should be evaluated in their own right. 
They should not be allowed to contaminate our appraisal of a stu- 
dent s understanding of the causes of Hitler’s rise to power or of 
Newton’s laws of motion. 

The Test is Limited to a Small Number oj Questions. When the 
individual must organize and compose an answer of some lencth, as 
vviih questions like those on p. 41, the number of questions is inevitably 
limited. The lime required to answer a single question makes it im- 
possible to include more than five or ten questions in even a fairly 
lengthy test This tends to result in what we might call a “lumpy” 



ADVANTAGES AND LIMITATIONS OF ESSAY AND OBJECTIVE TESTS 43 

sampling of what the student knows. We sink four or five big shafts 
mto the mine of knowledge that the student possesses. If these happen 
to hjt pay dirt, the student does weU; but if they hit the gaps in his 
I^owledge, he does poorly. With this small number of samples 
chance is hkely to play a relatively large part. We may get a very 
unfair sample of a particular student’s knowledge. 

Of course, it is possible to ask free-response questions that call for 
quite short answers. We might ask: Miat qualifications does the 
ConsUiution set for United Slates Senators? This question requires 
only a list of qualifications or a sentence or two for a complete an- 
swer, Questions such as this are transitional between the essay and 
objective test. TTiey can be numerous and can sample many items 
of knowledge or understanding. However, they sacrifice the main 
feature of the essay question^the requirement that the examinee put 
togetljer an organized answer in which he relates, evaluates, and inte- 
grates a number of facts and ideas. 


yinswers Are of All Degrees of Correciness. The bugaboo of the 
essay examination Is the laborious and subjective operation of evaluat- 
ing the answers. That it is laborious any teacher who has ever graded 
a set of essay papers for even a middle-sized class can testify. That 
the grading is subjective and relatively undependable has been shown 
by a ttticnbee of separate studies. 

Consider the following answers written by two eighth-grade stu- 
dents to the que.stion "Compare the powers and organization of the 
central government under the Articles of Confederation with the pow- 
ers and organization of our own central government today.” 


Student A 

Our gi3v«rf«neBf tadsy has a pres/denu a house of rcprcscnlalives, 
and a senate. Each stale has two senators but the number of repre- 
sentatives is di/ferent lor each stale. This is because of coropraraisc 
at the CoDSlilutional Convention. The Articles of Confederation had 
only a Congress and each state had delegates in it and had one vole. 
This Congress couldn't do much of anything because all the stales had 
to say it was alright. Back then Congress couldn't make people obey 
the law and there wasn’t no supreme court to make people obey the 
law. The Articles of ConfcderalioD let Cemgress declare war. make 
treaties, and borrow money and Congress can do these things 
But Congress then really didn't have any power. « had to ask the 
states for everything. Today Congress can tell the states what to do 
and tax people to raise money they dont have to ask the stat« * 
give them money. Once each state could print us own money if ti 
wanted to but today only the U. S. Mini can make money. 



THE TEACHER'S OWN TESTS 


U 


Student B 

There is a very unique difference between the Central Government 
under the Articles of Confederation and the National Government of 
today. The Confederation could not tax directly where as ibe Na- 
tional Government can. The government of today has three dmer- 
ent bodies— Legislative, Judicial, and Executive branches. The Con- 
federation had only one branch which had limited powers. The con- 
federate government could not tax the states directly or an individual 
either. The government of today, however, has the power to lax 
anyone directly and if they don’t respond, the government has the 
right to pul this person in jail until they arc willing to pay the taxes. 

The confederation government was not run nearly as efficiently as 
the government of today. While they could pass laws (providing 
most of the slates voted with them) the confederate government could 
not enforce these laws, (something which the present day can and 
does do) they could only hope and urge the states to enforce the laws. 

These tv-o answers together with three other answers written by stu- 
dents in the same class v.ere given to two groups of graduate students in 
courses in measurement or evaluation. Both groups of students were 
provided with a model answer to (be question and given the followiog 
instructions: 

Instructions: The essay question was a part of a social studies test 
consisting of fifty objective items and one essay question. The stu- 
dents were given 25 minutes to write their answers to the essay ques- 
tion. You have been given the answers written by five of the 
students. The class that these five students were in was a heteroge- 
neous one. Twenty-five points is the majumum score for the ques- 
tion. Please grade each paper using the model answer provided. The 
grade is to reflect completeness and accuracy of the answer — not 
quality of English expression, spelling, or grammar. 

Suppose that you grade these two answers in accordance with the in- 
structions given above before you read any further. Record the scores 
that )ou would give the answers. 


Now look at Table 3.1, which shows the scores actually given to 
all five answers, including these two. Every one of the answers re- 
ceives scores spreading over about 20 points of the possible range of 
25. Any one of Ihe papers might have gotten a score as high as 18; 
any one might have gotten a score as low as 5. The responses of 
students A and B acre judged to be outstandingly good by some 
raters, poor by others. The inconsistency ot the judgments is demon- 
sttalcd most (orcetully. A single raUng of any one of these papers 



ADVANTAGES AND UMItAtlONS OF ESSAY AND OBJECTIVE TESTS AS 

Toble 3.1. Grades Given to Five Answen to Essoy Ovestion 
Score Sinden tA Student B Student C Student D Student E 
6 ^ ~ 

2 4 

3 
9 


24 

23 4 

22 3 

21 8 

20 32 

19 6 

18 14 

17 6 

16 4 

15 23 

14 4 

13 2 

12 4 

11 I 

10 6 

9 1 

8 

7 2 

6 

5 3 

4 
3 
2 
1 
0 


21 6 

1 

11 3 

8 3 

2 2 

23 18 

1 3 

2 3 

13 9 

3 6 

11 33 

4 9 

6 9 

3 3 

3 

n 

I 

1 3 

I 


4 

5 
4 

24 

3 

12 I 

2 1 

4 

34 4 

5 

2 I 

7 6 

1 2 

4 25 

1 5 

3 6 

3 16 

2 50 

I 7 

15 

I 

1 


lells us very Jiiiie about how that same paper will be rated by someone 
else. Why is this? What makes Ihc appraisal ol an essay response 
so undependable? 

Let us admit to start with (hat the dice »crc somcuhat loaded 
against the graders in this little experiment. .Most of them were not 
social studies teachers, (hough the majority had had some teaching 
experience. (Previous experience has indicated that social studies 
teachers will show about as much variation.) Furthermore, they had 
not taught the class, and did not know anjthing about the general 
level of performance in this and similar groups. 

One major reason for the wide range of scores found in Table 3.1 
is that dilTcrcnl raters maintained vciy different standards for rating 
al! the papers. DiJicrent raters used quite different parts of the scale 
of scores. Though it was nsost conunon for a rater to spread his 



46 


THE TEACHER'S OWN TESTS 

scores between about 5 and 20, a few awarded no grade higher than 
10 to any of the answers while others assigned no grades below 15. 
These last two groups were operating in entirely different score ranges 
and showed no overlap. Tlic best for one group was lower than the 
poorest for the other. Judges differed not only in the average level 
at which they rated the papers, but also in how much they spread out 
their scores. Some were very “conservative/' bunching all their rat- 
ings close together, while others tended to spread them widely over 
the whole range. Such differences in grading standards arc very real 
in actual school situations — as every student knows— and provide one 
main source for inconsistency in grading essay responses. 

However, the judges were also not very consistent in the rank order 
in which they arranged the 5 papers. In Table 3.2 we have shown 


Table 3.2. Ronk Order Auigned to Each of Five Essay Questions 


Rank 

Student A 

Student U 

Student C 

Student D 

Student E 

1 

44 

29 

2 

33 

1 

1.5 

13 

12 

T 

11 


2 

28 

23 

8 

31 

i 

2.5 

12 

10 

5 

17 


3 

24 

32 

19 

23 


3.5 

1 

6 

9 

5 

3 

4 

3 

16 

55 

9 

11 

4.5 

3 

1 

18 

1 

20 

5 

i 

I 

13 


94 


how often each paper was ranked first, how often second, and so on. 
(Tie ranks have been indicated as 1.5, 2.5, etc.) In this table we see 
that every one of the 5 answers was ranked first by somebody, and 
every answer was either last or tied (or last. There is some consensus 
that student E wrote the poorest answer and student C the next poor- 
est, but practically no agreement as to the relative standing of the 
other three. Students A, B or D could easily have been judged best 
of the group or only average. Thus, there is not only a marked dif- 
ference in absolute standard from judge to judge, but also incon- 
sistency m the relative judgment of one paper in comparison with the 
others. 


Inconsistency m relative judgment is characteristic not only of dif- 
ferent raters but also of the same rater at different times. Thus, when 
the eva ueuon class was asked to grade the papers a second lime 3 
weeks later (without advance notice that this was to be done), a 


ADVANTAGES AND tlMlTATIONS OF ESSAY AND OBJEaivE TE5U 17 
third Of the ratings differed from the original rating by 5 P°i"« °t 

with differing responses is generally highly sub- 

;rnn“:.:abie. we shall consider laur in the chapter 
Srcan belne to deal wiUt these very real problems. 

THE OBJECriVE TEST ^ 

The objective jj,e correct answer, usually only 

in common the chara 

one, is determtned , 1 ^^. ,he 

tive” m ‘ ,sragc of an objective test is probably as sub- 
choice of content and ® of an essay test, and for 

jective as the choice o yoo,ive judgment involved in Ihe orig- 

jecUve test items are shown below. 

True-ralse that United States Senators shall be 

^ — elected tor tciros of 4 years. 

"a T aw pZ" by a legislature to punish a person without a court trial 

'"t‘'anefplf-r‘.law. 

B a bill of altamder. 

C a writ of /labcas corpus. 

D. a warrant. 

Completion /.uffraec) • 

The right to vole is calle 1 

Matching Column 11— Dates 

column I-Documcn.s ^ 

B The Malfloucr ^ 

C~ The ’’““'“"rharu* 'b^S 

X^The Magn“ ^ D. 1689 

^ E 1776 



48 THE TEACHER'S OWN TESTS 

The essential features of a test made of objective items, as distinct 
from an essay test, are that the examinee 

1. operates within an almost completely structured task. 

2. Selects one of a limited number of aUcrnaiivcs. 

3. Responds to each of a large sample of items. . 

4. Receives a score for each answer according to a predetermined key. 

Again, let us examine these characteristics to see the advantages and 
disadvantages of each. In large measure, they arc the reverse of Uiosc 
discussed for essay examinations. 

The Task Is Completely Structured. The examinee docs not have 
a chance to organize and define the problem for himself. On the debit 
side, this means that a test of this sort is not useful for appraising skills 
of organizing and structuring ideas. On the credit side, we arc 
sure that each examinee is presented with the same problem. Dis- 
cuss the Articles of Confederation,” can carry quite different mean- 
ings to different pupils. 

The Examinee Selects from Among Given Alternatives. In most 
types of objective item, the possible alternatives are completely speci- 
fied. (This is not the case with the completion type of item, and in 
that respect it is on the boundary line, approaching the short frcc- 
response type of question.) Where the alternatives are all provided, 
the student is only required to recognize the right answer, not to pro- 
duce it by his own efforts. This has been criticized as representing 
a lower level of Intellectual process, and one that is less true to life. 
How valid this criticism is probably depends upon how skillfully the 
objective items arc written, and how much they manage to get away 
from the words of the text and simple memory of factual materials. 
When an objective lest item presents a new problem that must be 
solved by recalling and applying facts or principles previously learned, 
this type of item can require just as active recall as any essay question. 

Another outcome of the limited set of answer choices is that an 
examinee can be expected to gel some answers right by guessing. This 
becomes a problem particularly for true-false questions in which there 
are only two choices. Tossing a coin would give 50 per cent right 
on the average, and people would get different scores to some extent 
because they were lucky or unlucky coin losscrs. The problem of 
guessing is serious in a short test with few answer choices, but chance 
successes tend to even up in the long run if there are enough items, 
if enough time is given for everyone to complete the test, and if in- 
structions about guessing can be made sufficiently definite so that all 
examinees will adopt the same policy. 



ADVANTAGES AND llM)7ATtONS Of £SSAT AND OSJECTIVE TESTS 49 

The Sample of hems Is Large. Since each item is brief, many 
Items can be included. These can be spread more evenly over the 
topics to be covered and a more representative sampling can be ob- 
tained. This reduces the role of luck, of the individual just happening 
to have reviewed a particular topic. As a consequence of the inclu- 
sion of many separate items, the score from a well-made objective lest 
is likely to be more accurate than that from an essay lest, so that two 
separate tests of an individual based on the same content areas will 
rank him in more nearly the same place in his group. 

Each hem Has a Predetermined Key. The key is established once 
and for all by the test maker at the time the test items arc written. 
This means that scoring the lest is a routine clerical task and can be 
done by a person who knows nothing about the subject matter of the 
test or even by one of the electrical test-scoring machines on the mar- 
ket. The saving in time to score the test is very substantial but U 
must be remembered that much of that saving will have been used up 
in preparing the test. Writing clear and unambiguous objectiic test 
items is a fairly demanding literary task. 

The economy in time is less Important than the uniformity in eval- 
uating answers that results. The score will be the same whoever 
scores the test, once die key has been agreed upon. The score will 
be the same no maucr who it was that chose the answers. Teacher’s 
pet or hellion, Spencerian speclalbt or scribbler, if they choose the 
same answer they get the same score. 


SUMMARY COMPARISON 


The issues we have been discussing arc summarized in tabular form 
below, fn each case a plus sign is placed in the column of the test 
pattern tJjat would be judged superior with respect to that factor. 


Factor Essay Of/rciinr 

Provides opportunity to test student's abriiiy to Jclcci, or- 
ganize. and integrate * , 

Requires student to produce answer a.nd not just rccog- ^ 

Is free from factors of sJ.iII in expression and pcnmanvhlp + 

Is free from opportuniu'es for blufling 
Is free from opportunilics for guessing 

Provides an adequately representative sample of the topics ^ 

covered ^ 

Can fac prepared quickly ^ 

Can be scored quickly ^ 

Can be scored routinely by a clerk 

Can be scored with high coawsicney from scorer to scorer 



50 


THE TEACHER'S OWN TESTS 


The balance of importance between these factors will vary from 
situation to situation, it is clear that neither type has exclusive claim 
to aU the advantages. In evaluating the work of his class, the teacher 
needs to use both kinds of testing procedures. 


EFFECTIVE USE OF THE ESSAY EXAMINATION 

Because of their advantages in evaluating abilities to organize an 
answer to a question, recall and select relevant information, and 
present it logically and efleciively, essay examinations should continue 
to be used in the evaluation of student performance. If they are to 
be used, the teacher should have some guiding principles as to when 
to use them and what he can do to overcome their common weak- 
nesses. These weaknesses are found partly in the format of the ques- 
tions and partly in the process of evaluating the answers produced by 
the students. 

WHEN TO USE ESSAY EXAMINATIONS 

The factors that make it appropriate to use an essay examination 
are in part very immediate practical ones, in part more fundamental 
theoretical considerations. 

Immediate Practical Conslderatiotts. The most obvious practical 
reason for using an essay examination is to save time. It takes a 
number of hours to prepare a good objective test. When the class 
group is small, there will be few papers to read and an essay examina- 
tion may actually save time. Moreover, when time to prepare an 
examination is limited the teacher can substitute reading time after 
the examination for preparation lime before the examination. Since 
many fewer essay questions than objective items are required for a 
given amount of testing time, the teacher may find it easier to con- 
struct a good essay lest than a good objective test. However, it should 
be emphasized that making good tests of any kind requires consider- 
able thought and effort on the part of the person writing the questions. 
A teacher cannot expect to produce good tests of any kind if he dashes 
off the questions a half-hour before tiic lest is to be ^ven. 

A consideration that may be compelling in some cases is lack of 
reproduction facilities for ruiuung off copies of the test. Then a set 
of essay questions written on the blackboard is a practical solution. 
In such a situation it would probably be wise to use some short free- 
answer questions requiring only a few words or sentences for an an- 
swer as well as those in true essay form requiring extended answers. 
This will permit a wider and more adequate sample of the students’ 
achicvcmcnL Another practical solution is to read obieciive Questions 


EFFECTIVE USE OF THE ESSAY EXAMINATION 51 

to the Class. This procedure will work for rather alert students and 
for fairly simple items but it tends to be inefficient in requiring repeU- 
tions of the items and it requires a somewhat special ability to remem- 
ber the total Item «cll enough to indicate an answer. With more 
complex items, the teacher will find that reading the items to the class 
becomes less satisfactory. 

A third point that is sometimes made is that essay questions are less 
demanding uopn the skill of the teacher. It is probably true that 
ambiguities and poor expression are more apparent in an objective 
item, but confusion as to what is wanted in response to an essay ques- 
tion can also be substantial In many educational settings, the stu- 
dent must know the person who wrote the essay question in order to 
write an acceptable answer. Many of the faults in writing objective 
items can be avoided once they have been pointed out, so that it 
seems more desirable to improve iiem-wriUng skills than to resort to 
essay questions as a defense. 

Afore Basic Theoreu'eat Issues. ITie functions that can be ap- 
praised better by an essay question than by short-answer or objective 
questions are abilities to select, relate, and organize, to create essen- 
tially new patterns and to use language to express one's ideas. For 
example, objective 5 on our blueprint on p. 33, “can express gen- 
eralizations and concepts in his own words,” can be measured only 
by an essay item since an objective item does not permit the student 
to use his own tvords. There would be little justification for using 
essay items to evaluate objectives one through three on our blueprint 
since these objectives require the reproduction of factual information. 
The essay question is an inefficient way to measure factual informa- 
tion that could be more effectively and efficiently measured by a series 
of objective items. 

Merely phrasing a question in the essay form does not automatically 
insure that the abilities to select and organize, to create new syntheses, 
to make new applications, or the other so-called higher mental abilities 
will be assessed. Most of the essay tests given in elementary and 
secondary schools and colleges measure nothing more than tlie abUity 
to reproduce facts. In order to assess the abilities that are best meas- 
ured by essay questions, the quesUons must be carefully phrased to 
require an application or creative synthesis of what has been taught. 
Thus, question A tests only infonnation. 

Question A 

What riehts are guaraMeed to Ibe peopls under the first ainaaiteenl to 
the Constitution? 



52 


THE TEACHER'S OWN TESTS 


Queuion B 

A newspaper. The Evening Standard, published a series of arlicles on 
the eity government of Townsville. In one arliclc, the reporter for the 
paper stated that the mayor of Townsville was ineornpet?nt and neU - 
cient and did not spend enough lime in his office to take care of city at- 
fairs. The mayor sued The Evening Standard in court for libel staung 
that the article made him loot bad to the people of the city and rcduccu 
his effectiveness as mayor. What decision could be expected from tne 
courts? Why? 


Question B, by contrast, requires identification and selection of the 
proper items of information, and their application to the solution of 
a new problem. Question D seems more clearly appropriate for an 
essay examination. 

In the early days of objective testing, some studies were carried 
out that showed that the prospect of an essay examination leads to 
study activities emphasizing the interrelationships of facts and prin- 
ciples in an area sshcreas the prospect of an objective examination 
leads to the memorizing of discrete details. There is little recent 
evidence on this point, and we wonder whether the relationship was 
a necessary one or merely a reflection of the low quality of the ob- 
jective tests to which the groups had been exposed. This finding 
certainly points out a potential weakness of objective tests, and one 
escape from this weakness is to use essay tests. We suspect that study 
habits depend less upon the form of the test exercises than upon the 
type of objective that is emphasized — whether the items are objective 
or essay. 

Variants on the Essay Examination. Values claimed for the essay 
examination are those of appraising ability to organize materials and 
to use language effectively to express the resulting organization. How- 
ever, in the usual scheduled essay examinations these functions may 
become submerged because ( I ) differences in knowledge of the basic 
facts hide differences in ability to organize those facts and (2) time 
pressures hide the quality of the individuals’ written expression. 

Two variations may be considered that appear likely to bring out 
the factors in which wc arc particularly interested. One is to give an 
open book” examination, in which every individual has access to any 
basic dam present in his text, his notes, or other sources. Memory 
of facts is then reduced as a factor entering into individual perform- 
ance, and ability to locate, select, and use the facts is brought to the 
fore. 


The second variation is to give the problems as an out-of-class ex- 
amination with unlimited time. This minimizes time pressure, and 



EffECTIve USE Of THE ESSA» EXAMINATION 53 

makes Ihc test more nearly a pun: power test— power both with respect 
to organizing ability and with respect to written expression. We do 
of course, introduce a new problem, since we arc less able to guarantee’ 
the tnlegrily of Ihc wnticn material turned in. When the examination 
IS used against rather than lor Ihc pupil, illicit help is likely to become 
a serious problem. 

tM>ROV(NG £SSAY TfSTS 

We have already pointed out in a previous section that essay tests 
can be improved by limiting ihcir use lo those objectives that arc best 
measured by the essay format. There is not much that a teacher can 
do to overcome the weakness of essay tests that arises from the limited 
number of essay questions that can be presented to students in a given 
period of lime except to give several essay tests during the school 
semester or year. A teacher can do much to overcome some of the 
other weaknesses of the essay lest by (1) writing good essay questions, 
and (2) improving his methods of evaluating the answers. In the 
next section we will give some guides to writing better essay questions. 
Following that is a section suggesting ways to improve the scoring 
of answers to essay questions. 

IMfgOVJNO m CONTfiNr Of AN tSSAY TEST 

The following paragraphs present and discuss several suggestions 
for improving the questions (hat go into an essay test. These are not 
seientifKaUy cslabiished principhs, but they reUcci the judgment of 
experienced test makers. 

1. Before starting to Write the Essay Question, Have in Mind Ex* 
plicitly What Mental Processes of the Student You Want to Bring 
Out by the Question. If you want to use the essay quesu'on to de- 
termine the extent to which a student can use his information, then 
the question must be phrased in such a way that the student must do 
such things as solve a problem that has not been directly taught, or 
point out relationships that have not been explicitly pointed out before. 

2. lit General, Stan Essay Questions with Such Phrases as Com- 
pare,” "Contrast,” ‘‘Give the reasons for” “Present the casuments 
for and against,” “Give original examples of,” and “Explain how or 
why” These words will help to present tasks requmng the stu- 
dent to select, organize, and apply his knowledge. Don’t start Ksay 
questions with such words as “what,” “who,” “when,” and hsL 
These words are likely to present tasks requiring only the reproduc- 
tion of information. 



5^ the TEACHER'S OWN TESTS 

3 Write the Essay Question in Such a Way That the Task Is 
Clearly and Unambiguously Defined for Each Examinee. A qucsuon 
such as “Discuss the factors and influences that led to the wnting and 
adopdon of our Consdtution ” is global, vague, and ambiguous. First, 
what does the teacher mean by the word “discuss”? Second, docs the 
teacher want the student to start with the Magna Charta in 1215 or 
with the settlement of the colonies or with the end of the Revolutionary 
War? Third, does the teacher want the student to stop with the be- 
ginnin g of the Constitutional Convention in 1787 or with the ratifica- 
tion of the Constitudon? Fourth, what does the teacher mean by 
“factors and influences?” The score that the student receives for his 
answer is likely to depend to a large extent on how lucky he is guess- 
ing what the teacher wanted. 

A better way to phrase this question so that each examinee will 
interpret the question in the same way would be: 

Explain how each of the following influenced the provisions written into 
our Constitution by the delegates to the Constitutional Convention. 

A. The Magna Charta, the Petition of Right, and the English Bill of 
Rights. 

B. The fear of tyranny or rule by one man or one group. 

C. The problems that arose in trying to operate under the provisions of 
the Article of Confederation. 

D. The fear of the smalt states that they would be controlled by the 
large states. 

E. Business rivalries between states. 

The question as it has been rephrased guarantees a more common 
basis for response. In one sense it breaks the one question up into 
five. The analysis also makes clear that on the original quesdon (and 
also the revised one) students will require a relatively long time to 
write an adequate answer. 

4. The Words "What da you think" "In your opinion" or "Write 
all you know about . . Almost Never Belong in an Essay Question 
to Measure Academic Achievement. The use of these phrases is com- 
mon on teacher-raade essay tests. But when a teacher asks: “Why 
do you think that the Articles of Confederation, provided a poor basis 
for the formation of our central government?,” he is not really inter- 
ested in the student’s opinion. He actually wants to determine whether 
the student knows the fundamental weaknesses of the Articles of Con- 
federadon, as stated by the U:acher or text. Therefore the quesdon 
would be better if written: “Why did the Articles of Confederadon 
prove to be unworkable as a framework for our national government?” 



EFFECTIVE USE OF THE ESSAY EXAMINATION SS 

The only time when the use of “you," “in your opinion." or “do 
you tiunk is justified in an essay question (or any other type of test 
quesUon) IS when the purpose of the question is to obtain an expres- 
sion of attitudes (which reaUy cannot be graded) or to determine 
how good a logical defense a student can make of Uie position that 
he has taken. In the latter instance, the teacher should noi be par- 
ticularly interested in which position the student takes and should 
evaluate the answer given only on the basis of how well the student 
defends or supports his position. 

5. Be Sure That the Students Do Not Have Too Many or Too 
Lengthy Questions to /fnswer in the Time Available. An essay test 
should not be a test of speed of writing. Good essay questions de- 
mand that the student consider the question, think about his answer, 
then write it. These processes take lime and the younger the student 
or the more complex the question, the longer is the required time. 
In order to answer adequately the revised question on p. 54, the typ- 
ical eighth grader would probably need horn 45 to 60 minutes. In 
most essay tests given in the classroom, three to five such questions 
are given to be answered in a single classroom period. This practice 
may encourage both sloppy thinking and sloppy writing on the part 
of the student. 


6. Do Not Use Both Essay and Objective Questions in the Same 
Test ivhen the Time for Testing is Limited. Quite frequently teachers 
use both objective and essay questions on the same test. It is not 
unusual to see a teacher-made test consisting of thirty to fifty multiple- 
choice questions and one to three essay questions, all of which arc to 
be answered in a 50-minute period. This practice is undesirable first 
because there is not enough time for the student to answer adequately 
all of the questions and second because there are very difficult prob- 
lems in combining the scores on the two difierent kinds of items. (See 


Chapter 17.) 

7. Have Each Examinee Answer the Same Questions. Dont Oger 
a Choice of Questions to be Answered. When an essay examination 
is being used to appraise achievement of the objectives of a common 
program of study, each examinee should be required to answer the 
same questions. Givins u “f queslions reduces the common 

base upon vshich dilTcrent individuals may be compared. 11 adds one 
turthcr source of variability to the subjectivily and inaccuracy that al- 
mady exist. A choice of queslions may have a publie-rclaUoM value 
with the examinees, but it has no juslificatim. from the point of view 


of effective measuremeni. 



56 


THE TEACHER'S OWN TESTS 


SCORING ESSAY EXAMINATIONS 

A number of steps may be taken to mitigate the subjectivity and 
reduce some of the biases in evaluating the answers to an essay ex- 
amination. These are mostly attempts to break up the process of 
evaluation into a series of more specific, fractionated judgments made 
upon a common base and applied to an anonymous product. Specific 
suggestions are outlined below. 

1. Decide in Advance What Factors Are to Be Measured. // More 
tlian One Distinct Quality Is to Be Appraised, Make Separate Evalua^ 
tio}is of Each. If facts are considered important, score for facts. If 
organization is important, give a rating upon organization. If me- 
chanics of English, sentence structure, spelling, punctuation, etc., are 
considered a significant outcome, give a rating upon mechanics. How- 
ever, do not contaminate the rating for knowledge or understanding 
with appraisal of mechanics. It is hard to isolate quality of organi- 
zation from extent of factual information, but if the essay question is 
to serve its distinctive purpose an attempt should be made to do so. 

2. Prepare a Model Answer in Advance. Showing What Points 
Should Be Covered and How Many Credits Are to Be Allowed for 
Each. This will provide a common frame of reference for evaluating 
each paper. After the preliminary model has been prepared, it should 
be checked against a sample of student responses to the question. The 
model and the scoring scheme should be modified in the light of these 
answers. They can now be used as the yardstick for assigning credits 
to each paper in turn. 

3. Read All Ans^vers to One Question before Going on to the Next. 
A more uniform standard can be maintained for a single question and 
for a short period of time. There is more chance to compare one 
person’s answer with another’s and thus to build up a “feel” for the 
answers. There is less contamination of judgment by what that same 
esamince had wntten on the previous question. 

4. Grade the Papers as Nearly Anonymously as Possible. The less 
you know about who wrote an answer, the more objectively you can 
grade what was written. 

5. Greater ReliMUty Can Be Obtained by Averaging independent 
Ratings. If the importance of the lest merits the expenditure of the 
extra ellort, a more dependable appraisal can be obtained by having 
one or more addilional raters each give an independent rating of the 

rc!?non«.p<. " 


KEFfKENCES 


57 


SUMMARY STATEMENT 

Evaluation of pupil achievement is one of tlie teacher's important 
responsibilities. In view of the many functions that tests serve in 
motivating and directing {earning, and in view of the disservice that 
may be done the pupil from poorly conceived or executed evaluation 
instruments, it is important that the teacher’s evaluation devices be 
well thought out and well made. Both written tests and a variety of 
informal appraisals are needed to evaluate completely the objectives 
of the modem curriculum. 

For any type of written test, it is desirable to have a definite plan 
in advance of preparing the test items. The development of such a 
plan requires an analysis of the outcomes one is trying to achieve in 
the teaching of a particular course or unit and of the significant seg- 
ments of content through which those objectives arc to be realized. 
A statement of objectives useful for guiding the construction of test 
items must be phrased in terms of pupil behaviors — specific things 
that the pupil Is supposed to be able to do — rather than in broad 
generalizations. In addition, the plan should include the allocation 
of lest items among the content areas and objectives, the types of Items 
to be used, the total number of items in the lest, and specifications for 
the spread of item difficulties. 

Both essay and objective tests should be used to evaluate pupil 
achievement. The essay test is easier to prepare and has certain ad- 
vantages in appraising ability to recall information, select relevant 
material, and organize it into an integrated answer. However, the 
objective test has marked advantages in freedom from such irrele^’ant 
factors as quality of handwriting or of English usage, in breadth of 
sampling of the desired outcomes of teaching, and in ease and ob- 
jectivity of scoring. 

Essay questions can be improved by phrasing the question so as 
to present a well-defined task to the student and by providing condi- 
tions for scoring that reduce as far as possible the subjectivity of 
grading. 


REFERENCES 

1. Bloom, Bcojamifl S. Max D. Engelhart. Edward J. Furst, Walker H. 
Hill and David R. Krathwohl. Taxonomy of edacaiiontil ohjecuses. 
the classification of cditcaiional goats: handhoaK 1. cogniihe domain. 
New York and London, Longmans, Green. {^56. 



53 


THE TEACHER'S OWN TESTS 


SUGGESTED ADDITIONAL READING 


Bloom, Benjamin S., Editor, Tw:ono,ny ol cducaional Hand- 

book I. Cogniiive domain, New York, Longmans. Green, 1956. 

Dressel. Paul L.. and Lewis B. Ma)hcw. General educanon: 

evaluation. Washington. D. C, American Couned on Education, U5 , 

French, Will, Behavioral goals of general education in high school. New 


York, Russell Sage Foundation. 1957. t. i i - t 

Harris, Chester W., Editor, Encyclopedia of educational research, 3ro c ., 
New York, Macmillan, I960, pp. 650-657, 1506-1514. 

Kearney, Nolan C., Elementary school objectives. New York, Russell Sage 


Foundation, 1953. 

Lindquist, E. F., Preliminary considerations in objective lest construction. 
Chapter 5 in E. F. Lindquist, Editor, Educational measurement, Wash- 
ington, D. C., American Council on Education, 1951. 

Odell, C. W., How to improve classroom testing, rev. cd., Dubuque, Iowa, 
William C. Brown, 1958, Chapters 111, IV, V, and VI. 

Smith, Eugene R.. el al.. Appraising and recording student progress, New 
York, Harper, 1942, Chapters 1 and 2. 

Stalnakcr, John M., The essay type of examination, Chapter 13 in E. F- 
Lindquist, Editor, Educational measurement, Washington, D. C., Amer* 
lean Council on Education, 1951. 

Thomas, R. Murray. Judging student progress. 2nd ed.. New York, Long- 
mans, Green, I960, Chapter 2. 

Vaughn, K. W., Planning the objective lest. Chapter 6 in E. F, Lindquist, 
Editor. Educational measurement. Washington, D. C., Americaa Coun- 
cil on Education, 1951. 


QUESTIONS FOR DISCUSSION 

1. Prepare a statement of the objectives for a course, or a unit within 
a course, that you are tcachbg or plan to teach. 

2. Which of the objectives in 1 could be measured effectively by a 
written test? Which only partially or not at all. Why is a written test 
inadequate for these? How might these objectives best be appraised? 

3. Based on the objectives identified in the first part of question 2 and 
a course outline, prepare a blueprint for a lest to evaluate the unit or 
course. 

4. In a junior high school, one teacher takes complete responsibility for 
preparing the common final examination for all the classes in general sci- 
ence. He makes the examination up without consulting the other teach- 
ers. \yhat advanUges and disadvantages do you xe in this procedure? 

5. It has ^cn that one of the goals of the music program in an 
cl^cniaiy schixl U to “increase the sensitivity of pupils to music in its 
diirercnt forms. ’ How could this goal be defined so that progress toward 
It could be measured? 



QUESTIONS FOR DISCUSSION 59 

6. Students are sometimes heard lo remark; “You can’t get a good 
mark on Miss X’s tests unless you really know Miss X.” What does this 
remark imply about Miss X’s tests? 

7. On p. 49 is a list of factors that have been presented as favoring 
either essay or objective rests. Do you agree uith the classification given 
there? Which are the most important factors? What other points should 
be considered in deciding which type of test to use for the final examina- 
tion in a particular course? 

8. Criticize the following features of an essay test planned for a ninth- 
grade social studies class: 

a. There will be 10 questions on the lest 

b. Each student will answer any 5. 

c. Each question will have a value of 20 points. 

d. One point will be taken off for each misspelled word and each gram- 
matical error. 

e. A 5-point bonus wifi be given for neatness. 

f. Time for the test will be 40 minutes. 

9. Criticize and revise each of the following essay questions: 

a. Discuss the increase in juvenile delinquency since World War JI. 

b. Discuss government support of farm prices. 

c. Discuss the “cold war.” 

10. For what types of objectives would an open-book essay examination 
be appropriate? What would be the advantages and disadvantages of such 
an examination, as compared with the usual essay examination? 



Chapter 4 

T 

Preparing Objective Tests 


INTRODUCTION 

The objective type of test item was developed in order to overcome 
some of the disadvantages of the essay test discussed in Chapter 3. 

As v.e pointed out in that chapter, there is still a good deal of argu- 
ment about the relative merits of the two types of test. Those who 
object to the objective type of test say that it emphasizes factual ma- 
terial, encourages piecemeal memorization of unimportant details, 
permits too much guessing of the correct answer, ignores the higher 
mental processes, neglects the more important educational objectives, 
and never gives the student any practice in writing. Except for the 
last objection, we have discussed the other criticisms in Chapter 3. 
As for the last objection, we might well raise the question as to 
whether the testing period is the place to give students practice in 
writing and whether the kind of writing practice provided by most 
essay tests encourages (or discourages) good writing. 

As w'c have stated before, the question of which kind of test to use 
is not an either-or question. Both essay and objective tests can be 
used to advantage in the classroom. A poorly constructed test of 
cither kind can inhibit or misdirect learning. The problem then is 
to construct good tests. In this chapter we will consider methods of 
improving and using the objective type of item and of analyzing and 
using the results of objective tests. 

WRITING THE ITEMS FOR AN OBJECTIVE TEST 

Writing good lest items is an art. It U a little like writing a good 
sonnet and a Jiltle like baking a good cake. The operation is not 
quite so free and fanciful as wriUng the sonnet; it is not quite so 
standardized as baking the cake. It lies somewhere in between. So 
a discussion of item writing lies somewhere between the exhortation 
to the poet to go out and express himself and the precise recipes of a 
good cookbook. The point we wish to make is that there is no exact 


WRITING THE ITEMS FOR AN OBJECTIVE TEST 


saence of test construction. The guides and maxims iJiat we shaJl 
offer arc not tested out by controlled scientific experimentation. 
Rather, they represent a distlUation of practica] experience and pro- 
fessional judgment. As with the recipe in the cookbook, if carefully 
followed they yield a good product. 

We shall first present some suggestions that apply to almost any 
type of objective item. Then vfe will consider specific item types, 
Indicating some of the general virtues and limitations of that type of 
item and giving more specific suggestions for writing and editing. A 
number of the principles that we set forth will seem very obvious. 
However, experience in reviewing and editing items indicates that these 


most obvious faults are the ones that are most frequently committed 
by persons who try to prepare objective tests. Thus, it hardly seems 
necessary to insist that a multiple-choice item must have one and only 
one right answer, and yet items with no right answer or several occur 
again and again in tests that are carelessly prepared. 


GENERAL MAXIMS FOR ITEM WRITING 

1, Keep the Reading Difficulty of Test Items Low in relation to tlie 
group who arc to take the test, unless the purpose is to measure verbal 
and reading abilities. Ordinarily you do not want language difficulties 
to interfere with a pupil’s opportunity to show what he knows. 


Example 

Poor: What was the ostensible reason for requesting the states to desig- 
nate one or more of their constituents as representatives to attend a gen- 
eral convendon to meet in Philadelphia in 1787? 

A. To draft a new Constitution. 

B. To raise money to pay off Revolutionary War debts. 

C. To settle commercial disputes among the states. 

D. To revise the Articles of Confederation. 

Better: When the states «^re asUd to send rerpesenJatives to a general 
convention to meet in Philadelphia in 1787, they were loJd that these rep- 
resentatives would be asked to 

A. draft a new Consiuulioo. 

B. raise money to pay oil Revolutionary War debts. 

C. settle commercial disputes among the states. 

D. revise the Articles of Con/cdcratioii. 

2 Do Not aft a Statement Verbatim from the Textbook. This 
places a premium upon rote memoiy vrilh a minimum o! unjrstan,^- 
ing. Also the statement may have little or no meaning ..hen it ts 



42 PREPARING OBJECTIVE TESTS 

removed from the context. A statement can at least be paraphrased. 
Better still, in many cases it may be possible to imbed the specihc 
knowledge in an application. 


Example 

Poor: T F The House of Representatives shall be composed of 

members chosen every second year. e u , ^ 

Better: T F A United States Representative elected for a lull term 
of office to"be^n in 1961 would end his term in 1963. 

3. // an Item Is Based on Opinion or Authority, Indicate Whose 
Opinion or What Authority. Ordinarily statements of a controversial 
nature do not make good items, but there are instances where know- 
ing what some particular person thinks may be important for its own 
sake. The student should presumably be acquainted with the view- 
point of his textbook or instructor, but he should not be placed in the 
position of having to endorse it as indisputable fact. 

Example 

Poor; T F The Declaration of Independence influenced later polit- 
ical developments more than any other document. 

Better: ^ F According (o your textbook, the Declaration of Inde- 
pendence influenced later political developments more than any other 
document, 

4. In Planning a Set of Items for a Test, Take Care That One Item 
Does Not Provide Cues to the Ansiver of Another Item or Items. The 
second item below gives cues to the first. 

Example 

1. Under the provisions of the Constitution, the judicial branch of our 
National Government is given the power to 

A. enforce the laws. 

interpret the laws. 

C. make the laws. 

O. repeal the laws. 

2. The interpretaUon of laws by the judicial branch of our National 
Government has been one method used to 

A. keep the powers of government in the hands of the people. 

” ^'n^css^ president from being dominated by a sUong 

C. parantec Consdtutional rights to all citizens. 

— *^*^.^,* ^®**^dluiion fleuble enough to meet changing social, 
pohucal, and economte conditions. 



WRIHNG THE ITEMS FOR 08JECTIVE TEST 63 

5. Avoid the Use of Interlocking or Interdependent Items. The 
answer lo one item should not be required as a condition for solving 
the next item. This is the other side trf the principle stated in 4 above. 
Every individual should have a fair chance at each item as it comes. 
TTius, in the example shown below, the person who does not know 
the answer to the first question is in a very weak position as far as 
attacking the second one is concerned. 


Example 

1. The name of the first wntien constitutioa in the American colonies 
was th e (Fundamental Orders of Connecticut) . 

2. This constitution was drafted in the year (1639) . 

6. In a Set of Items, Let the Occurrence of Correct Responses Fol- 
low Essentially a Random Pattern. Avoid favoring certain responses, 
j.e., cither true or false, or certain locations in a set of responses. Do 
not have Ute responses follow any systematic pattern. 

7. Avoid Trick and Catch Questions, except in the rare case in 
which the test has a specific purpose of measuring ability to keep out 
of traps. Trick questions are likely to mislead the abler or better- 
informed student, who knows enough to be caught by the trap. If they 
do this, they defeat the basic purpose of the test, which is to identify 
levels of knowledge and understanding. 

Example I 

Poor: T F 1710 term of oflicc for all senators Is 6 years. 

(The item is keyed true but the student who knows the most about gov- 
ernment is likely to gel it wrong because a senator who is elected or ap- 
pointed to take the place of a senator who dies serves only the unexpired 
time. Also, among the first group of senators at the lime the Constitution 
was adopted, some served only 2 years, some 4, and some 6.) 

Better: T F The Constitution stales that the term of office for sen- 
ators shall be six years. 


Example 2 


Poor: T On May 25, 1787, «fty-five dcicgalas from waive stales 
met to revise the Articles of Confederation. 

(This was keyed false because all fifty-Bve were not present on May 
25. No revision is shown tor this item because the .dea being tested is 
considered so insignilicant that it nouid be better not to use the item.) 


S. Try lo Avoid Ambiguity oS Statement and Meaning. This is a 
general admonition, somewhat lilie "Sin no more," and it may be no 



preparing objecti^^ tests 

„,ore etolive. Hov.ev«. it k certainly true that ambiguity of state- 
ment and meaning is the most pervasive fault m objective test ite ^ 
Many of the specific points already covered and many of those stiU 
be covered deal with specific aspects of the reduction of ambiguity. 


Example 

Poor: In the Consiilulion, the composition of Congress was established 
in order to 


A. maintain balance of power between the large and small states. 

"b. protect the interests of propertied classes. 

C. gel the delegates to accept and sign the Constitution. 

D. provide for a stronger central government. 

The keyed answer to the above question was A, but the examinee trying 
to answer the item is faced with several problems. First of all, what does 
the writer of the item mean by "the composition of Congress?” Does he 
mean the division of Congress into two houses, the basis for determining 
representation in Congress or the qualifications of the members of Con- 
gress? Does the writer of the item want the student to give the immediate 
reason for the compromise or the uliimaie reason? Actually the writer of 
this item was trying to determine whether the student knew why the Con- 
stitution provided for a Congress made up of a House of Representatives 
with proportional representation from each state and a senate with equal 
representation from each state. 

But even if the student guesses correctly what the item writer had In 
mind when he wrote the item, he is likely to have difficulty with the answer 
choices. 

A case can be made for each of the answer choices being correct. All 
of the compromises at the Constitutional Convention had two aims: to 
provide for a stronger central government and, at the same lime, to draft 
a document that the states would be willing to accept. There is some truth 
in choice B because the large sutes feared that the small stales would pass 
laws interfering with business and property. Of all the answer choices, 
the keyed answer A is probably the least correct since the purpose was not 
to maintain exact balances of powers between large and small states but 
to grant some concessions to each. 

The item needs to be sharpened up in several respects. The example 
below would appear to test the same knowledge and to provide less occa- 
sion for misunderstanding of what the examiner was trying to say. 


Dftter: At the Constitutional Convention, the delegates agreed to give 
each state equal representation in the Senate and proportional represenu- 
tion m the House of Rcprcsenuiivcs in order to 


/\. satisfy the conflicting demands of the large and small states. 

B. protect the rights of the sovereign states. 

C. make the IcgUbtivc branch of the central government the strongest. 

D. Ucp the government m the hands of all the people. 



WRITING THE ITEMS FOU AN OBJECTIVE TEST 65 

9. oj Ucm Dealing ni,h Trivia. An ilem on a test should 

appraise some important item of WWge or some significant nnder- 
seandrag, /Vvoid the type of item that could quite justifiably be an- 
swered, Who cares?” Ask yoorseif in each case whether Uowing 
or not knowing the anssver would make a significant difference in the 
individual s competence in the area being appraised. 


lExamp/e 

Poor: A census every 10 years was provided for in Ibe Conslilulion in 
Article I Section 

A. 1 

B. 2 

C. 3 

D. 4 

Betler: The reason the framers of the Constituiion provided that a na- 
tional census should be taken every 10 )cars was lo 

A. obtain mfortnation needed by Congress to carry out its dunes, 
determioe how maoy Representatives each state should have. 

C. determine how rapidly the country was growing. 

D. obiaia accurate information for use by government and industrial 
agencies. 

nUE-FALSE HEMS 

The true-false llcni has had a popularity in lcacber*made objective 
tests far beyond that warranted by its essential nature. Tliis has prob- 
ably happened because bad true-false items can be written quickly and 
easily. To write good ones is quite a different matter. 

Even when they arc well wriffen, tcuc-false itena are relatively re- 
stricted in the types of educational ojcciive they can measure. They 
should be limited to statements that are unequivocally true or demon- 
strably false. For this reason, they are adapted to measuring relatively 
specific, isolated, and often trivial facts. They can also be used fairiy 
well to lest meanings and definitions of terms. But items testing genu- 
ine understandings, inferences, and applications are usually very hard 
to cast in true-false form. The true-false item is particularly open to 
attack as fostering piecemeal, fractionated, superficial learning and is 
probably responsible for many of the attacks upon the objective test. 

It is also in this form of test that the problem of guessing becomes 
most acute. . , j i 

The commonest variety of tnie-falsc item presents a simple occlar- 
olive statement, anil requires ot the esaminee only that he iudicalo 
whether it is true or false. 



66 


PREPARING OBJECTIVE TESTS 


Example 

T F The Articles of Confederation provided for a sUong central gov- 
cmmenL 

Several variations have been introduced in an attempt to improve 
the item type. One simple variation is to underline a part of the 
statement, viz., “strong” in the above example. The instructions indi- 
cate that this is the key part of the statement and that it determines 
whether the statement is true or false. That is, the correctness or 
appropriateness of the rest of the statement is guaranteed. The exam- 
inee can focus his attention upon the more specific issue of whether 
the underlined pan is compatible with the rest of the statement. This 
seems to reduce guessing and make for more consistent measurement. 

A further variation U to require the examinee to correct the item 
if it is false. This works well if combined with the underlining de- 
scribed above but is likely to be confusing if no constraints are intro- 
duced in the situation. Our example could be corrected by changing 
“Articles of Confederation” to “Constitution,” by changing “strong” 
to “weak,” or by changing “centrar to “state.” Requiring that the 
item be corrected reduces guessing and provides some further cue to 
the individual’s knowledge. 

Generally, the true-false type of hem tends to be most useful when 
it is based on some given stimulus material such as a chart, map, 
graph, table, or reading passage and when the student responds to the 
item only in terms of the given material. This type of true-false item 
has been used effectively in testing ability to interpret data of differ- 
ent kinds. However, in this case, the format is generally changed by 
requiring the student to answer in four or five categories such as defi- 
nitely true, probably true, insufficient data to determine whether h is 
true or false, probably false, and definitely false. In this format the 
item is more like a multiple-choice item than a true-false item. 

CAUTIONS IN WRITING TRUE-fAlSE ITEMS 

1. Be sure that the hem as Written Can Be Unequivocally Classi- 
fied as Either True or False. One of the most common weaknesses 
in true-false items is that the person who knows the most about the 
content may find it difficult to judge whether the item is true or false. 
This is particularly likely to happen with items that were intended to 
be true statements. The student who knows the most about the con- 
tent can often think of a number of exceptions or reasons why the 
statement U not universally true. Consider the following example. 



WRITING THE ITEAAS FOR AN OBJECTIVE TE5T 


67 


Example 

Poor: T P The prcsidoitial candidate tvho receives the jnaiorily 
of voles is eiccicd President. 

The item was keyed true but smelly speaking it is not true. The candi- 
date must receive the majority of electoral votes but not necessarily the 
majoruy of the popular vote. It is the higher-achieving student who is 
hkely to know about both the electoral votes and the popular vote and he 
is likely to mark the item false because it does not specify electoral votes. 
The item w'ould be better if it were revised as folloM's: 

Example 

Better: F The presidential candidate receiving a majority of the 

electoral votes is elected President. 

2. Beware of "j'pcci^c Deterntiners,” words that give cues to the 
probable answer, such as all, never, usually, etc. Statements that 
contain “all,” “always,” “no,” “never,” and such all-inclusive terms 
represent such broad gcneralirations that they are likely to be false. 
Qualified statements involving such terms as “usually” or ‘'some- 
times” are likely to be true. The test-wise student knows this, and 
will use these cues, if he is given a chance, to get credit for knowl- 
edge he does not possess. “AH” or “no” may sometimes be used to 
advantage in true statements, because in this case using the determiner 
as a cue will lead the c;taminee astray. 

Example 

Poor: T F All sessions of Congress are called by the President. 

Better: T F All persons elected to the House of Representatives 
must be at least 25 years old. 

3. Beware of Ambiguous and Indefinite Terms of Degree or Amount. 
Expressions such as “frequently,” “greatly,” “to a considerable de- 
gree,” and “in most cases” arc not interpreted in the same way by 
everyone who reads them. Ask a class or other group what they 
think of when you say that something happens “frequently.” Is a 
once a week or once an hour? fs it 90 per cent of the time or 50 per 
cent? The variation wUl be very great. (Ed.: How great is very 
great?) An item in which the answer depends on the interpretation 
of such terms as these is an unsatisfactory one. 

Example 

Poor: T F The Supreme Court « frequeoOy required lo rule oo 

the constitutionality of a law. t a law 

Better: T F The Supreme CbuH has the power to declare a Ja 

unconstitutional. 



PREPARING OBJECTIVE TESTS 

4 Be>>are of Xegaihe Slalements and Particularly of Double Nega- 
ihes. The negative is likely to be overlooked in burned reading 
an item, and the double negative is hard to read and confusing. 


Example 

Poor: T F The Conitituiion does not provide that no state law can 

deny a citizen the right to vole. • u, m-ike 

Belter: T F The Constitution grants to each state the nght to mat.e 
laws specifying the qualifications for voting in that state. 

5, Beware of Items that Include More than One Idea in the State- 
ment, Especially If One Is True and the Other Is False. This type of 
item borders on the category of trick hems. It places a premiuin on 
care and alertness in reading. The reader must not restrict his at- 
tention to one idea to the exclusion of the other or he will be misled. 
The item lends to be a measure of reading skills rather than knowl- 
edge or understanding of subject content 

Examples 

Poor: T _F The President has the power to make treaties with for- 
eign countries, but the Senate must approve them by a majority of votes. 

Belter: T ^ The Senate must approve a treaty with a foreign coun- 
try by a majority of votes. 


Poor: T JF No person shall be elected to the office of president 
more than ts^icc, but a person who has acted as president for 2 years or 
more shall be eligible for re-election for at least two full terms. 

Belter: T ^ A person who has acted as president for 2 or more 
jears can be rc-clected twice. 

(In each of the poor items, the first statement is true and the second 
one is false.) 

6. Beware of hems Where the Correct Answer Depends upon One 
Insisnificant Word, Phrase, or Letter. Each test item should measure 
an important aspect of the student's achievement; therefore each true- 
false item should require the student to react to important ideas and 
should not require him to be a proofreader. Many teachers try to ob- 
tain a spread of scores on a test by introducing items that require the 
student to examine each word and each letter in the word in order to 
arrive at the correct answer. For example, the item, “Ulysses Sampson 
Grant was President of the United Stales from 1869 to 1877,” ap- 
peared as a true-false item on a sixth-grade social studies test and was 



WRITING THE ITEMS FOB AN OBJECTIVE TEST 
keyed false because Grant’s middle name was Simpson, not Sampson. 
Surely, knowing Gram’s middle name is not a significant aspect of 
achievement in sixth-grade social studies; towever, if it is, then the 
item should be written so that attention is drawn to the middle name 
of Grant; e.g., “T F Ulysses Grant's middle name was Sampson.” 

7. Beware of Giving Cues to the Correct Answer by i/ie Length of 
the Item. There is a general tendency for true statements to be longer 
than false ones. This is a result of the necessity of including qualifica- 
tions and Jimitaiion.s to make the statement true. The item writer 
must be aware of this trend and make a conscious elfort to over- 
come it. 

SHOftT-ANSWeH AND COMHETION HEMS 

The short-answer and the completion item tend to be very nearly 
the same thing, differing only in the form in which the problem is 
presented. If it is presented as a question it is a sbort-aaswer item, 
whereas if it is presented as an incomplete statement it is a com- 
pletion item. 


Example 

Short Answer: In what colony was the hrst lepresentaiive assembly in 
America establuhcd? 

Completion: The first representative assembly la America was estab- 
lished in the colony of (Virginia) . 

Items of this type are well suited to testing knowledge of vocabulary, 
names or dates, identification of concepts, and ability to solve alge- 
braic or numerical problems. Numerical problems that yield a spe- 
cific numerical solution are “short answer” in their very naure. The 
measurement of more complex understandings and applications is 
difficult to accomplish with items of this type. Furthermore, evalua- 
tion of the I'aricd responses that are given is likely to call for some 
skill and to introduce sonic subjectivity into the scoring procedure. 

MAXJMS concerning COMPtETION ITEMS 

1, Beware of Indefinite or “Open" Completion Items. In the first 
example, on p. 70, there are many words or phrases dial give factuaUy 
correct and reasonably sensible completions to the statement, i.e., 
“arrested ” “imprisoned,” “acquitted ” “criUcal of the government,’ 
“from New York," “a publisher.” The problem needs to be more 
fully defined, as is done in the revised statement. 



70 


PREPARING OBJECTIVE TESTS 


Example 

Poor: The man whose case won freedom of the press for our country 
was (Zenger) . , 

Belter: The name of the man whose case won freedom of the press lo 
our country was (Zenger) . 

2. Omit Only Key Words. Do not leave the verb out of a com- 
pletion statement unless the purpose of the item is to measure knowl- 
edge of verb forms. The blank in a completion item should require 
the student to supply an important fact. 


Example 

Poor: The Constitutional Convention (met) in Philadelphia in 1787. 
Better: The Constitutional Convention met in Philadelphia in the year 
(1787) . 

3. Don’t Leave Too Many Blanks in a Statement. Overmulilation 
of a statement reduces the task of the examinee to a guessing game or 
an intelligence test. 

Example 

Poor: The (Ordinance) of (1787) provided for the (admls; 
sion) of (new states) . 

Better: The procedure for admitting new states lo the Union was first 
set forth by the (Ordinance of 1787) 

4. Blanks are Belter Put Near the End of a Statement Rather Than 
at the Beginning. This permits the problem to be stated before the 
blank is encountered. 

Example 

Poor: A(n) (tariff) Is a tax on goods imported into a country. 
Belter: A lax levied on goods imported into a country is called a(n) 
(tariff) . 


5. If the Problem Requires a Numerical Answer, Indicate the Units 
in Which It Is to Be Expressed. This will simplify the problem of 
scoring and will remove one possibility of ambiguity in the examinee’s 
response, 

AAUti;Ptf.CHOlC£ ITEMS 

The multiple-choice item is the most flexible and most effective of 
the objective item types. It b effective for measuring information, 



WRITJNG THE HEMS FOR AN OSiECUVE TEST 7} 

yocabulaiy, underslandings. appJication of principles, or ability lo 
interpret data. In fact, it can be used to test practically any educa- 
tional objective that can be measured by a pencil-and-paper test except 
the ability to organize and present material. The versatility and effec- 
tiveness of the multiple-choice item is limited only by the ingenuity 
and talent of the item writer. 

The multiple-choice item consists of two parts: the stem, which 
presents the problem, and the list of possible answers or options. The 
stem may be presented in the form of an incomplete statement or a 
question. 


Examph 

Incomplete statement: If both the Prejidcnt and Vice-President died in 
office, the person who would act as President would be the 

A. Majority Leader of the Senate. 

B. President of the Senate. 

C Speaker of the House of Representatives. 

D. Secretary of State. 

Question: Who would act as President If both the President and Vice- 
President died in office? 

A. The Majority Leader of the Senate. 

B. The President of ihc Senate. 

C. The Speaker of (he House of Representaiives. 

D. The Secretary of State. 

Inexperienced item writers usually find it easier to use the question 
Sorm of stem than the incomplete sentence form. The use of the ques- 
tion forces the item writer to state the problem explicitly. It rules 
out certain types of faults that may creep into the incomplete state- 
ment, which we will consider presently. However, the incomplete 
statement is often more concise and pointed than the question, if it 
is skillfully used. 

The number of options used in the muIUpIe'Choice question differs 
in different tests, and there is no real reason why it cannot vary for 
items in the same test. However, to reduce the guessing factor, it Is 
preferable lo have four or five options for each item. On the other 
hand, it seems more sensible to have only three good options for an 
item than to have five, two of which arc so obviously wrong that no 
one ever chooses them. 

The difficuliy of a mulup)c-<*ote “'U “P®" 

■■closeness" of liie options end Uie process called for in the item. 



72 PREPARING OBJECTIVE TESTS 

Consider the set of three items shown below, all relating to the First 
Amendment to the Constitution. Wc can predict with some conh- 
dencc that version I will be passed by more pupils than will II, ana 
II by more than III. The difference between I and II is in the close- 
ness of the options— in I, the wrong choices fall completely outside 
the Bill of Rights, i.e., the first ten Amendments to the Constitution, 
while in II, each option refers to some one of these Amendments. The 
difference between II and III is primarily a matter of the intellectual 
process involved — II requires little more than remembering and recog- 
niring the key concept involved in the different amendments, while III 
requires that the student identify that concept when it is embedded 
in a specific concrete situation. 


Version I 

The First Amendment to the Constitution is concerned with 

A. pov.ers of Congress. 

B. the abolition of slavery. 

C. freedom of speech, press, and religion. 

D. the term of office of the President. 

Version II 

According to the First Amendment to the Constitution, the government 
is permitted to 

A. search a person’s house without a warrant. 

B. hold a person in jail for a long time without a trial. 

C. make laws that interfere with freedom of speech or religion. 

D. force a person to give evidence against himself. 


Version III 

\Vhich of the following actions would violate the rights guaranteed to 

a person by the First Amendment to the Constitution? 

A. An F.B.I. agent gels a lip that counterfeiters are operating in Mr. 
Jones basement, so he breaks in the door to search the basement. 

B. Mr. Smith is arrested and hdd in jail for three weeks but is not 
informed of the charges against him and is not allowed to see a 
lawyer. 

C. Mr. Simpson is arrested for writing articles criticizing the govern- 
ment's defense policies. 

D. Mr. Hoffman, who is on trial for conspiracy, is forced to take the 
witness stand and give evidence. 



Writing the items for an objective test 73 

MAXIMS FOR MULT/FLE-CHOlCf JTEMS 

I. The Stem of a Multiple-Choice Hem Should Clearly Formulate a 
Problem. All ibe options should be possible answers to a single prob- 
lem that is raised by the stem. When the stem is phrased as a ques- 
tion, it is clear that a single problem has been raised, but this should 
be equally the case when the stem is in the form of an incomplete state- 
ment. Avoid items that arc really a series of unrelated true-false 
items dealing with the same general topic. 


Example 

Poor: At the Constitutional Convention, the “great” compromise 

gave small and large states equal representation in the Senate. 

B. made slave holding legal. 

C. was opposed by Washington. 

D. gave the western lands claimed by the states to the federal goven- 
ment. 

Better: At the Constitutional Convention, the “great” compromise be- 
tweeo the large and small states was concerned with 

jA. representation in Congress. 

B. importation of slaves. 

C. the power to levy taxes. 

D. commerce between states. 

2. Include as Much of the Item as Possible in the Stem. In the 
interests of economy of space, economy of reading time, and clear 
statement of the problem, it is usually desirable to try to word and 
arrange the item so that the stem is relatively long and the several 
options relatively short. This cannot always be achieved but is an 
objective to be worhed toward. This principle tics in with the one 
previously stated of formulating the problem fully In the stem. 


Example 


Poor: According to the Constitution, neither Congress nor the states 
can pass a law 

A. that would raquire a citoo to te abfc to road aid write before be 
S. lhat*wouM prevent a citizen from voting because he did not own 


C. lharw ould make it impossible for a citizen to vole because he had 


committed a crime. 

D. that would deprive a citizen 
Chinese descent. 


of the right to vote because he 


of 



74 


PREPARING OBJECUVE TESTS 


Accordino lo the Constitution, neither Congress nor the stales 
can pass a law . hr, Von, d deprive a ci.izcn of his right .0 vote because he 


A. could not read or write. 

B. did not own property. 

C. had committed a crime. 

D. was of Chinese descent. 

3, Don’t Load the Stem Down with Irrelevant Material. In certain 
special cases, the purpose of an item may be to test the examinees 
ability to identify and pick out the essential facts. In this ease, it is 
appropriate to hide the crucial aspect of the problem in a set of de- 
tails that are of no importance. Except for this ease, however, the 
item should be written so as lo make the nature of the problem posed 
as clear as possible. The less irrelevant reading the examinee has to 
do, the better. 


Example 

Poor: The framers of the Constitution faced many problems. The dele- 
gates to the Constitutional Convention rcpresenied states with dllTcrent 
interests, and the delegates from the Individual stales wanted to see that 
their states’ interests were protected. However, the delegates agreed that 
the Articles of Confederation needed to be changed in order to provide for 

A. a President whom everyone could respect, 
a stronger central government. 

C. a better understanding between states. 

D. a government for and by the people. 

Better: The delegates lo the Constitutional Convention agreed that the 
Articles of Confederation needed to be changed in order to provide for 

A. a President whom cver^'one could respect, 
a stronger central government. 

C. a better understanding between states. 

D. a government for and by the people. 

4. Be Sure that There Is One and Only One Correct or Clearly Best 
^fuvi er. It hardly seems necessary to specify that a multiple-choice 
item must have one and only one right answer, but in practice this is 
one of the most pervasive and insidious faults in item writing. Thus, 
in the following example, though choice A was probably designed to 
be the correct answer, there is a large element of correctness also in 
choices B and D. The item could be improved as shown in the revised 
form. 



WRITING THE ITEAU FOR AN OBJECTIVE TEST 


75 


Example 

ple'^ho CoDs'llK'on was generally opposed by peo- 

A. owed money. 

B. thought that most of the people were unfit to govern themselves. 

C. owned businesses. 

D. thought the states would be destroyed. 

Belter: The adoption of the Consliiutioo was generally opposed by peo- 
ple who 

A, owed money. 

B. were engaged in commerce. 

C owned western land 

D. were engaged in manufacturing. 

5. Items Designed to Measure Understandings, insights, or Ability 
to Apply Principles Should Be Presented in Novel Terms. If the situa- 
tions used to measure understandings follow very closely the examples 
used in text or class, the possibility of a correct answer being based 
on rote memory of what was read or heard is very real. The second 
and third variations of the example on p. 72 illustrate an attempt to 
move away from the form in which the concept was originally stated. 

6. Beware of Clang Associations. If the stem and the keyed answer 
“sound alike,” the examinee may get the question right just by using 
this superficial cue. However, superficial associations in the wrong 
answers represent one of the effective devices for attracting those who 
do not really know (he fact or concept being tested. This last practice 
must be used with discretion, or one may prepare trick questions. 


Example 

Poor: A system of checks and balances was established by the Coosiilu- 
Uon in order to 

A. balance majority power and minority rights. 

B. appease the small stales. 

C. distribute powers between the central government and the state gov- 
ernments. 

D. provide for flexibiljty In the central government. 



76 


PREPARING OBJECTIVE TESTS 


Belter: A system of checks ami balances was established by the Consti- 
tulion in order lo 

A. prevent one group or one person from seizing the power of govern- 


meni. 

B. balance ihc po'Acrs of the small and large states. 

C. distribute powers equally between the central government 
stale governments. 

D. provide for flexibility in the Constitution. 


and the 


7. Beware of Irrelevant Grammatical Cues. Be sure that each op- 
tion is a grammatically correct completion of the stem. Cues rom 
the use of the indefinite article (“a” versus “an”) in the stem, l s 
number or tense of a verb, the use of the plural form of a noun or 
pronoun, etc., must be excluded. 


Example 

Poor: A power of the federal government that is suggested by the Con- 
stitution but is not directly stated in the Constitution is called an 

A. concurrent power. 

B. residual power. 

C. implied power. 

D. delegated power. 

Better: A power of the federal government that is suggested by the Con- 
stitution but is not directly stated in the Constitution is called 

A. an executive power. 

B. a concurrent power. 

C. an Implied power. 

D. a residual power. 

(Note that one option was changed lo provide for two options that used 
“an” since test-wise examinees sometimes use the one article that U dif- 
ferent as a cue to the correct answer.) 

8. Beware of the Use of One Pair of Opposites as Options If One 
of the Pair is the Correct or Best Answer. The directions for a mul- 
tiple-choice test usually instruct the examinee to choose the one correct 
or best answer. If only one pair of opposites is used as options and 
one of the pair is the correct answer, the examinee is likely to limit 
his choice of answers to these two options because he thinks that both 
of them cannot be wrong. When this happens, the item is likely to 
operate as a two-choice item rather than as a four- or five-choice item, 
and the probability of guessing the correct answer is increased. It is 
better, if possible, to use two p^s of opposites or to eliminate the use 
of opposites. 



WRITING THE ITEMS FOR AN OBJECTIVE TEST 


77 


Exotnple 

Poor: The chief objeclivc of Daniel Shay's Kehellion was 10 fora Ihe 
Stale of AJassachusetls to 

A. grant ex-soldJcrs the right to vote. 

issue paper currency. 

Ce withdraw paper currency. 

D. stop slave trading. 

ffeffer.- The chic^ objective of Daniel Shay's Rebellion was to force the 
stale of Massachusetts to 

A. grant ex-soldiers the right to vote. 

B. issue paper currency. 

C. pay cx'soldiers for their services in the Revolutionary War. 

D. stop slave trading. 

9. Seivarg o} the Usg of "None of These," "None of she Above," 
"All of These," and "All of the Above" as Options. Except for items 
requiring numerieaJ computation the option “None of these" or “None 
of the above" usually fails to make any sense since it contradicts the 
stem or docs not complete the stem grammatically, As a rule both 
options (end to be used as fillers, i.e., when the item writer cannot 
think of a fourth or fifth answer choice, he sticks in “None of these” 
or "All of these" usually as an incorrect answer. 

The use of the option "All of these” as a correct answer in a four- 
or five-choice item generally makes an item less discriminating because 
if (he examinee knows that at least two of the answer choices are 
correct he automatically gets the correct answer whether he knows 
anything about the other options or not. 

If "None of these" or "All of these” is used as an answer choice, 
it should 6c used as frequently for (he correct choice as are any of 
the other options. 

Examples 

Poor: The Federahsi Papers were wrntten by 

A. Hamillon. 

B. Jay. 

C. Madison. 

D. All of the above. 

Fester: The Federalist Papers were writtea fay 

A. Hamilton, Jay, and Madison. 

B. Hamilton, JelTerson. and Madison. 

C. Jefferson, Washington, and Franklin. 

D. Washington, Franklin, and Jay. 



73 PREPARING OBJECTIVE TESTS 

Poor: Under the Articles of Confederation the national government ob- 
tained money to run the government by 

A. putting a tax on imports. 

B. printing additional paper currency. 

C. borrowing money from foreign governments. 

D. none of the above. 

Seller: Under the Articles of Confederation the national government 
obtained money to run the govcinmcnt by 

A. putting a tax on imports. 

B. printing additional paper currency. 

C. borrowing money from foreign governments. 

D. taxing property. 

10. Use the Negative Only Sparingly in the Stem of an Item. It is 
usually desirable to emphasize the positive aspects of knowledge rather 
than the negative aspects of knowledge. However, there arc times 
when it is important for the student to know the exception or to be 
able to detect errors. For these purposes a few items with the words 
“not” or “except” in the stem may be justified, particularly when over- 
inclusion is a common error for students. When a negative word JS 
used in the stem of an item, it should be underlined and/or capitalized 
to call the student’s attention to it. 

Example 

Poor: Which one of the following leaders of the Revolutionary War did 
not want the Articles of Confederation changed? 

A. Benjamin Franklin 

B. George Washington 

C. Alexander Hamilton 

D. Patrick Henry 

{Note this is a poor use of the negative stem because it could be stated 
more etlectively in positive form, “Which one of the following leaders of 
the Revolutionary War favored keeping the Articles of Confederation?") 

Belter: According to the Cbnstitution, the President does NOT have the 
power to 

A. declare war. 

B. pardon a person convicted by a federal court. 

C. call a special session of Congress. 

D. nominate judges for the Supreme Court. 

(Note this is a better use of the negative stem because ( 1 ) it requires 
the student to detect a common error made about the powers of the Presi- 
dent; and (2) it would be difficult to get three good misleads if the item 
positive form. The stem of the item could not be stated 
The Constitution forbids the President to ” because the Consti- 

tution does not specifically forbid the President to declare war.) 



WRITING THE ITEMS FOR AN OBJECTIVE TEST 7? 

THE MATCHING ITEM 

The matching item is actually a special form of the multiple-choice 
Item, ■rae characteristic that distinguishes it from the ordinary mul- 
tiple-choice item is that instead of a single problem or stent with a 
group of suggested answers, there are several problems whose answers 
must be drawn from a single list of possible answers. 

The matching item has most frequently been used to measure fac- 
tual information such as the meaning of words, dates of events, asso- 
ciation of authors with titles of books or titles with plot or characters, 
names associated with particular events, or association of chemical 
symbols with names of chemicals. The matching item is a compact 
and efficient way of measuring this type of achievement. 

Effective matching items may often be built by basing the set of 
items upon a graph, chart, map. diagram, or picture of equipment. 
Features of the figure may be labeled, and the examinee may be asked 
to match names, functions, etc., with the labels on the figure. This 
type of item is particularly useful in tests dealing with science or tech- 
nology, e.g., identification of organs in an anatomy test. 

However, there are many topics to which the matching item is not 
very well adapted. The items making up a set should bear some re- 
lationship to each other; that is, they should be homogeneous. In 
the case of many of the outcomes one would like to test, it is difficult 
to get enough homogeneous items to make up a set for a matching 
item. 

Consider the example that appears below. 

Instructions: Match the statements in Columri I with those in Column 11. 

Column / Column II 


1. First Ten Amendments A. 1215 

2. We owe much of our demo* B. George Washington 

craiic heritage to this country. C Authors of the Federalist 

3. Date of the Magna Charta Papers 

— - . 4. Jay, Hamilton, and Madison D. England 

5. Chairman of the Constilu- E. Bill of Rights 

tional Convention 


This example illustrales most of Uie common mistakes made m 
preparing matching items. First, the directions are vape cause 
they do not specify either the basis for matching or how the examinee 
fa to record his answers. Second, the statements in Column I hate 
nothing in common excepi that aU of them refer to matenals nsua ly 
included in an eighth-grade unit on the Conshlohon. l^ok at sla c- 
ment 3 in Column I which asks for the date of the 
Column If includes only one date. Successful matching '"I"*' 
no knowledge on the part of the student. Each hem ,n the set can 



gQ PREPARING OBJECTIVE TESTS 

be matched in the same way. using only the most superficbl cues^ 
Thiid note the number of answer choices provided in Column 11 to 
match with the 6ve statements in Column I. If the ‘ 

dicate that each answer U to be used only once, then the J*" 

knows four of the answers automaucally gets the fifth by ’ 

and the person who knows three has a fifty-fifty chance on the las 


MAXIMS ON /HATCHING ITEMS 

1. When Wriling Mulching Hems, the Hems in a Set Should Be 
Homogeneous. For example, they should aU be names of peiwns, 
or all dates of events, or all provisions of different parts of the Con- 

stituUon. , 

2. The Number of Answer Choices Should Be Greater Than tne 
Number of Problems Presented. This holds except when each answer 
choice may be used more than once, as in variations that we shall con- 
sider presently. 

3. The Set of Items Should Be Relatively Short. It is belter to make 
several relatively short matching sets than one long one because ( 1 ) 
it is easier to keep the items in the set homogeneous and (2) it is 
easier for the student to find and record the answer. 

4. Response Options Should Be Arranged in a Logical Order, if 
One Exists. Arranging names in alphabetical order or dates in chrono- 
logical order reduces the clerical task for the examinee. 

5. The Directions Should Specify the Basis for Matching and Should 
Indicate Whether an Answer Choice May Be Used More Than Once. _ 
These precautions will guarantee a more uniform task for ail examinees. 

A variation on the matching type of item which is sometimes effec- 
tive is the classification type or master list. This pattern, illustrated on 
p. 81, presents an efficient means of exploring range of mastery of a 
concept or related set of concepts. 

Another setting in which the master list variation of the classification 
type of item can often be used to advantage is that of testing knowl- 
edge of the general chronology or sequence of events. Sec Example U 
on p. 81. 

There are a number of other varieties of objective test items that 
have been developed and used to some extent. The reader who is 
interested in a survey of these, together with a more extended discus- 
sion of icachcr-madc tests in general, is referred to the suggested addi- 
tional readings at the end of the chapter. 



TESTING FOR UNDERSTANDING 


81 


Example I 

A if i. is speciacally P=™,t,cd or rrqoi-rd b, .b= Coosi.luiioo. 

a if ;; i: "y 

I ‘'rKplnirii^oo" “^'.hc Prrsidro. delivers 
z ^oo“lrp's- at; rir.to-- rrapes fro. S.-OO 

3. ?ootess passes a Uw reou-a 

school unlil they reach 'b' S X to sugECSl 

"fr":h::trh:sa,o.ac.oryaseo^^^^^^^ 

‘■'“'“'■(and possibly others). 


(C) 
(B> 

(D) 


Example It 

,„„ccheveo. 00 the left. pi. the choice 00 .eripht, hat teiisoheo 

■ event took place. . 


Events 

(E) 1 , Womeo were staoted suf- 

(D) 2. Alf persons bom or 

~ uralizctl in the U. 

SSaredtobeeit^j^. 

-^tHa^foradorittioP — 

states was adopted. 

(Cl 5. Freedom ol re s 

speech, and press were 

guaranteed 

(and possibly others). 


Time Line 
A 

Declaration of Independence. 

B 

Adcpiion of the Comiitution. 

C 

Civil War. 

D 

World War 1- 


for UNDERSTANDING 

• testing factual knowledge 



g2 PREPARING OBJECTIVE TESTS 

daUy of the objective type, tend to emphasize facts. Teachers tend to 
assume that if a student knows the factual material, then he also under- 
stands that material. Although there is a positive relauonship be- 
tween factual knowledge and understanding, the relauonship is not 
perfect. It is true that in order for the student to understand a prin- 
ciple, he must have the relevant facts and basic skills. But there is no 
assurance that mere possession of the facts means that the studen 


really understands the material. 

If students are to develop understandings, understandings must be 
taught and they must be evaluated. In the measurement of under- 
standing, the situations or applications used in evaluation should be 
similar to, but not identical with, the examples used in class. If the 
same situations are used, the student may get the correct answer be- 
cause he has memorized the example ^ven in class, not because he 


understands the principle. 

ObjecUve test items do not divide up into two clearly distinct groups, 
those that measure factual kno\^lcdge and those that measure under- 
standing, application, or interpretation. Many items involve under- 
standing and application at various levels as well as the underlying 
factual knowledge. Thus, illustration III on p. 72 and the matching 
item on p. 81 both call for applications of knowledge to new situa- 
tions. Multiple-choice items in particular readily lend themselves to 
testing the understanding and application of principles viith novel ma- 
terial or in novel settings. 

Another type of item is the interpretive type item. This type of 
item consists of an introductory selection of material, giving the nec- 
essary background and selling the problem, followed by a series of 
questions asking for interpretations of the material. The introductory 
material can be text, graphs, tables, maps, charts, or any similar mate- 
rial. It can be complete in itself, providing all the necessary infomia- 
tion basic to the understanding, or it can be incomplete so that the 
student must know certain things in addition to those given. 

The eighth-grade unit on the Constitution that we have used so far 
does not provide for good examples of the interpretative type of exer- 
cise. However, two examples of the interpretive lest exercise that 
were constructed for a twelfth-grade unit on labor unions are given. 
The first is based on a graph showing certain data about union mem- 
^rship, strikes, and important social and economic events. In this 
Item, the accuracy of the student’s answer depends only upon his ability 
to Mderstand the material as it is presented to him in graphic form. 

The second example is based on a newspaper item, and the student 








§SS§2222S2--- 

,, , Th^W DK.r 

^ 5 S“m i"= ■>»!“»>' ”°"’'”J° 

- ■ » -- --- 

'■ fwp in union, „„ in 1945 =»«>! •I'O 

=>- MUn.949. 



PREPARING OBJECTIVE TESTS 

(A) 8. The period beWcen 1920 and 1930 was marked by a steady 

decrease in union membership. 

(A) 9. The establishment of the CIO was followed by an increase 

in union membership. , .... 

(A) 10. The pattern of number of workers out on strike is simii 
to that of the number of workers belonging to unions. 


Example U 

Radio station WKRX uses only recorded music on lis programs. A wn- 
tract between the radio station and the musicians’ union had required inc 
station to hire a certain number of musicians, even though the musicians 
never played on any programs. When the contract ended, the radio sta- 
lion refused to renew it. Members of the musicians' union started to 
picket the radio station headquarters to force it to renew the contract. 
When the baseball season started, members of the union began to picket 
the local baseball park because the ball games of the local team 
broadcast over station WKRX. The owners of the ball team and of the 
radio station look the case to court and asked the court to rule whether 
the picketing was legal. 

What was the most probable ruling by the court? 

A. Only the picket line at the baseball park was legal. 

B. Only the picket line at the radio station was legal 

C. Both picket lines were legal. 

D. Neither of the picket lines was legal. 

From the statements below check all that support your answer. 

1. Workers cannot be prevented by management from using 

any peaceful method of protecting their jobs. 

2. The Taft-Hartlcy Act permits strikes when other means of 

settling disputes fail. 

X 3. Secondary boycotts arc forbidden by the Tafi*Hartley Act. 

X 4. "Featherbedding” practices by unions arc forbidden under 
the Taft-Hartlcy Act. 

5. Since the picketing of the baseball park was against the radio 

station and not against the baseball team, the owners of the 
baseball team had no grounds for court action. 

6. Since baseball is a sport, not a business, a baseball park can- 
not be used to force the sciilcmcnl of a dispute between 
labor and management 

7. Strikes cannot be called against an employer who docs not 

have a contract with a union. 

The interpretive type of item provides an opportunity to ask mean- 
in^ul questions about complex data in order to evaluate the student’s 
ability to understand and interpret such materials. 

However, this item type presents special problems. The introduc- 
tory material must be carefully cho^n to elicit the type of understand- 
ing that the teacher desires. Although a number of sources such as 



GETTING THE OWEaiVE TEST REAOV fO» USE 05 

newspaper, magazines, or books can be used to furaisJi the introduc- 
toiy matenal, it usually has to be rewritten and adapted by the teacher 
to keep It at an appropriate reading level and to eliminate unnecessary 
parts. The success of this type of item is dependent to a large extent 
upon the adequacy of the introductory material. 

Another disadvantage of the inteipretive type of item is the reading 
load. Most of these items lend lo be long, so that the evaluation of 
understanding will be contaminated by the reading level of the student. 

A third disadvantage is the amount of space required to present the 
item and the amount of time required to answer it. With this type of 
item it is not possible to get as many different units of coverage as with 
the usual type of multiple-choice item. 

For a more detailed discussion and for more examples of methods 
of measuring understanding in the different subject-matter fields, the 
reader is referred to the Foriy~Fifi/i Yearbook of ifie National Society 
for the Study of Education, listed in the supplementary readings at 
the end of the chapter. 


GETTING THE OBJECTIVE TEST READY FOR USE 

So far we have considered the problems involved in improving the 
quality of the individual objective test items. Now must give some 
thought to putting the items together into a test that b un effective 
whole. The quality of the total test will have been determined in large 
measure by the quality of our initial planning and by the skill with 
which we have written the separate test items. However, some further 
suggestions may help in achieving a sound and workmanlike product. 


EXTRA ITEMS 

When the items arc originally written it will usually pay to write a 
surplus over the number that will finally be used. Items that seem 
masterworks in the first pride of authorship may show unsuspected 
flaws when coldly re-examined at a later date. Furthermore, some 
freedom for fitting the final test lo the specifications of the blueprint 
is often helpful. A surplus of 20 or 30 per cent is none too much. 


REVIEW AND EOIT/NG 

It is always sound policy, if time permits, to write the items early 
and put them aside for a uhilc. 'Vhen reread later, ambisuitics uill 
appear that ucre not seen at all *hcn the item »as first uniten. Eton 
more helpful, if il is feasible, is tojet another person oho Inooi the 
subieet matter to £0 o.cr the Items, U-)ing them and cmirtons them. 
This t)-pe of rcsicss »ill usually bring out a rather startlins number 



preparing OBJEalVE TESTS 

of points of ambiguity or disagreement. Revision of the items in the 
light of such a critique or eliminaUon of items that seem no 
salvageable tviU do much to avoid those debates with ^ 

those ill-feelings that are an occasional feature of objccti 
nations. 


fORM Of REPRODUCTION 

Though it is possible to give objecuve examinations orally, it is far 
from satisfactory to do so. Oral administration is demanding upon 
students’ concentration and introduces an clement of speed pressure 
that is quite disturbing to some. One generally assumes that an ob- 
jective test will be reproduced and that each pupil will have a copy. 
Gelatin duplicating processes are adequate for groups of moderate siK, 
but most test makers will prefer to mimeograph the test if facilities or 
mimeographing are available. More important than the process is I e 
quality of the work, both In organizing the layout of the test and in 
typing up the master copy. 

ORDER AND OROUPfNG Of TEST ITEMS 

After the items have been edited and those to be included in the 
test have finally been selected, they must be arranged in the order m 
which they are to appear in the test. There are three aspects that 
should be considered and reconciled as far as possible in deciding upon 
the arrangement and grouping of items. 

1. Items in the same format (true-false, multiple-choice, etc.) 
should be grouped together, so that instruaions for answering will 
cany throughout the set. 

2. In general, an attempt should be made to progress from easy to 
more difficult items. This is especially important with younger chil- 
dren, who may become discouraged and quit if the early items are too 
difficult. It is also important if time is likely to be limited, so that some 
items will not be reached. These not-attempted items should be the 
more difficult ones that the examinee would not have been likely to 
answer correctly even if he had reached them. 

3. Items dealing w’ith similar content can well be grouped together. 
If this is done, it will help to reduce the feeling that the test is made 
up of unrelated bits and pieces. It will encourage a more integrated 
attack by the examinee. 

DJfi£a»ONS 

Gear instructions to the examinees are an important element in a 
wcll-construclcd test. Examinees will usually know the purpose of a 
test, but if it is possible that they may not the purpose should be 



87 


GEHING THE OBJEaiVE TEST READY fOR USE 

Staled. CompJete instrocUons should be given as to how the pupil 
IS to record his answers. This is particularly important for novel or 
unusual item patterns. The examinee should be given explicit infor- 
mation as to the scoring procedure that will be used, including the 
credit for each item or pari and whether or not a correction will be 
made for guessing. (See Scoring, p, 88.) 

Sample sets of directions for matching items have been given on 

p. 81. 

For a test made up of multiple-choice items that will not be cor- 
rected for guessing and for which separate answer sheets are used, one 
might use the following set of directions. 

Z7iVecf/onf; 

Read each item and decide which choice completes the statement 
or answers the question. 

Mark your answers on ihe separate answer sheet. Do ncf mark them 
on the test booklet, fndicate >our answer by blacking out on the answer 
sheet the letter corresponding to your choice. That is. if you think that 
choice B is the be.st answer to item 1, black out the 5 in the row after 
No. 1 on your answ'er sheet. 

Your score will be the number of right answers, so it wiU be to your ad- 
vantage to answer every question, even if you are not sure of the right 
answer. 

ffe sure your /tame is on your answer s/ieet. 

For a test made up of true-false questions in which answers are to 
be recorded on the test paper and the total score will be corrected for 
guessing, the following set of directions could be used. 

Zfirecf( 0 /ts: 

Read each of the following statements carefully. 

If all or any part of the statement is false, circle the F in front of the 
statement. 

If the statement is completely true, circle the T in front of the siatc- 
ment. . . •. e 

Your score will be the number of right answers minus the number oi 
wrong answers, so do /tot guess blindly. If you are not reasonably sure o 
an answer, omit ihe question. 

Be sure your name is on your test. 


LAYOUT OF ITEMS 

The two points important to bear in rataci when planning boiv the 
items and answers will be placed on the sheet are (U clarity and con- 
venience for the examinee and (2) convenience for the scorer. In 
the interest of the person takiog the test, items should not be crowded 
together too closely. Multipic-clioicc items ate easier to read ,f cart 
response option is on a separate line. Having part of an .tern on one 



PREPABING OBJECTIVE TESTS 


Course- 
Exam 


Name- 


Date— 


InstniRUons: Read the directions on the t«t sheet /“"Xe 

them exactly. For each test item, mark your choice for the 
correct answer by blocking out the letter which corresponds 
to the best answer for the test Item. 


1 Item 

Answer 

Item 

Answ'cr 

Item 


Answer 

E^ 

1 

ABODE 

26 

ABODE 

51 

A 

BCD 


2 

ABODE 

27 

ABODE 

52 

A 

BCD 

E 

3 

ABODE 

28 

ABODE 

53 

A 

BCD 

E 

4 

ABODE 

29 

ABODE 

54 

A 

BCD 

E 


Fij. 

4.2. Port ol 

' o hono-mode en»w*r 

tKeel. 





page and part on the next should be avoided if possible. If several 
items all refer to a single diagram or chart, it is desirable that all of 
them appear on the same page as the diagram or chart. 

The arrangement of answers should be such as to facilitate scoring. 
Even in the upper elementary school it is practical to put spaces for 
all the answers in a column on one side of the page. A scoring key 
can then be laid beside the answer column to speed up scoring. In 
the junior high school and above, a simple separate answer sheet may 
be used. Part of a home-made answer sheet which is adaptable for 
both true-false and multiple-choice items is shown in Fig. 4.2. 

In school-wide or city-wide testing projects, machine-scored answer 
sheets of the type developed for standardized tests may be used, if fa- 
cilities for machine scoring arc available. 

SCORING 

Layout of answers to facilitate scoring has been discussed in the 
previous paragraphs. A scoring stencil that can be placed alongside 
the columns of answers or placed directly over a separate answer sheet 
will make scoring go very quickly. 

The lest maker must decide how he is going to treat guessing in his 
scoring procedure. As wc have indicated, his decision should be 
made known to the examinees. If time permits every student to at- 



ANALYZING AND USING THE RESULTS OF OBJECTIVE TESTS 


lempt every item, a score that is simply the number of right ans^vers 
IS quite satisfactory. In this case, examinees should bs firmly in- 
structed to guess, even if they have no idea of the answer. This pro- 
cedure has sometimes been criUdzed as poor pedagogy, since it in- 
volves practice m errors. However, the student will think about each 
Item anyhow. It seems doubtful that the final step of marking an 
answ'cr, when one knows m one’s own mind that one is just guessing, 
will have any very lasting impact on the impression one carries away 
from the test. 


If the test is speeded, so that pupils will attempt different numbers 
of items, or if the test user wishes to discourage guessing on the part 
of examinees, a penalty should be applied for wrong answers. The 
usual correction formula, based on the assumption that the person who 
does not know the answer will make a random guess. Is 


Score “ i? — - — ■ 
n - I 

where R is the number of questions answered correctly: 

IV is the number of questions answered incorrectly; 

M is the number of answer choices for an item. 

For example, in a true-false test where there arc only 2 possible an- 
swers, n — 1 becomes 2 - 1, or 1, and the correction for guessing 
is the number of right answers minus the number of wrong answers. 
Thus, if there were 75 troc-falsc items on a test and a student got 48 
right, got 20 wrong, and did not answer 7 of them, his score would be 
48 - 20 or 28. Note that omits do not count in this formula for 
guessing. 

For a second example, suppose a student took a 60-iteni multiple- 
choice test in which each item had 5 possible answers. If he got 52 
questions right and 8 wrong, bis corrected score would be 


52 - 


8 

5-1 


or 



ANALYZING AND USING THE RESULTS OF 
OBJECTIVE TESTS 

Giving the test, scoring it, and recording a score for each pupil 
frequently ends the matter as far as the tcaoher is tmneemed. How- 
ever, if the teacher drops the test at IhU point, he loses much of us 
value. An analysis of the responses the pupils made to the 
serve two important purposes. In the hist place, the test results pro- 



5Q PREPARING OBJECTIVE TESTS 

vide a diagnostic technique for studying the learnings of the class and 
the faUures to learn and tor guiding further teaching and study, in 
the second place, the responses of pupils to the separate iteras and 
a review of the items in the light of these responses provide a basis tor 
preparing better tests another year. 

The basic analysis that is needed is a tabulation of the responses 
that have been made to each item on the lest. We need to know how 
many pupils got each item right, how many chose each of the possible 
wrong answers, and how many omitted the item. It helps our under- 
standing of the item if we have this information for the upper and 
lower fractions of the group, and perhaps also for those in the middle. 
From this type of tabulation, we can answer such questions as the 
following for each item: 


1. How hard is the item? 

2. Does it distinguish between the better and poorer students? 

3. Do all the options attract responses, or are there some that are 
so unattractive that they might as well not be included? 

A simple form can be prepared for recording the responses to each 
item, like that shown in Fig. 4.3. Thb can be put on a separate card 
for each item, and then the information can be accumulated in a per- 
manent item file. This form is planned for a multiple-choice item with 
as many as five choices but can be used for true-false items by using 
only the A and B columns. 


hem: Which one o( the following states was formed from the North- 
west Territory? 

A. Indiana 

B. Iowa 

C. Montana 

D. Oregon 


Option 



A 

B 

C 

D 

E 

Omit 

Upper 25Co 
Middle SOCo 
Lower 25% 

10 

17 

5 

1 

1 

1 

3 



1 1 


tig. 4.3. form Im (Merding itcm.ancilytii dota. 



ANALYZING AND USING THE RESULTS OF OBJECTIVE TESTS 9| 

To illustrate the type of ijifonnatioii that is provided by an item 
analysis, wc present below certain items from a social studies test, to- 
gether with the analysis of responses for each hem. This test was 
given in 1960 to 100 high-school seniors who had had a course in cur- 
rent American problems. There were 95 items on the test. The high- 
est score on the test was 85 and the lowest score was 14. The test 
papers were arranged in order of total score starting with the score of 
85 and ending with the score of 14. The top 25 papers were selected 
to represent the upper group (score range 59 to 85) and the last 25 
papers were selected to represent the lower group (score range 14 to 
34). The count of responses is based on the 25 cases from the top 
and the 25 cases from the bottom of the group. The responses made 
to each item by each individual in the upper and lower groups were 
tallied to give the frequency of choosing each option. These frequen- 
cies are shown on the right. TTie correct option is underlined. Each 
item is followed by a brief discussion of the item data. 

Z/e/n I 

“Everyone's switching to Breath of 
Spring Cigarettes!” is an example of 
the propaganda technique called 

Upper Lower 


A. glittering generality. 0 2 

^ bandwagon. 25 20 

C. testimonial. 0 2 

D. plain folk. 0 I 

(Omit) 0 0 


This is an easy item, since all 25 in the upper group and 20 in the 
lower group get it right. However, it does differentiate in the desired 
direction, since what errors there are fall in the lower group. The item 
is also good in that all of the wrong answer choices are functioning; 
i.e., each wrong answer has been chosen by one or more persons tn 
the hner group. Tno or three easy items like this would be good 
“ice-breakers" with which to start a test. 

Item it 

There were no federal income uxes 
before 19J3 because prior lo 1913 

Upper Lower 

A. the federal budget was balanced. 3 5 

B. regular property taxes provided 
enough revenue to run the gov- ^ 
eminent. 



92 


PREPARING OBJECTIVE TESTS 


C. a tax on income was unconsti- 
tutional. 

D. the income of the average 
worker in the U. S. was too low 
to be taxed. 

(Omit) 


13 

0 

0 


0 

5 

0 


This was a difficult item but a very effective one. That i^'as diffi- 
cult is shown by the fact that only 13 out of 50 got it right. That it was 
effective is shown by the fact that all 13 getting the item nght v.ere m 
the upper group. All of the wrong options attracted some choices in 
the lower group and aU of the wrong options attracted more of 
lower group than the higher group. Incidentally, an item such as this 
shows how faulty the idea of “blind guessing” often is when an item is 
effectively written. In this item, the majority of the lower group con- 
centrated upon one particular wrong option that was particularly 
plausible and appealing. 


hem III 

Under the “corrupt practices act” the 
national committee of a political party 
would be permitted to accept a con- 
tribution of 

Upper Lower 

A. $10,000 from Mr. Jones. 15 4 

B. $1,000 from the ABC Hat Cor- 
poration. 4 6 

C. $5,000 from the National Asso- 
ciation of Manufacturers. 2 8 

D. $500 from union funds of a 

local labor union, 4 7 

(Omit) 0 0 

This item turned out poorly. Only 10 out of 50 got it right, and 
right answers were more frequent in the lower than in the upper group. 
As far as the lest is concerned, it appears that this item would have to 
be cither discarded or radically revised. If the group was supposed to 
have learned about the provisions of the “corrupt practices act,” this 
shows clearly that the learning did not take place. In order to arrive 
at the correct answer to the hem the student would have to know (1) 
the limit placed on contributions to the national committee of a politi- 
cal party, (2) who is forbidden to make contributions, and (3) what 
kind of organization the National Association of Manufacturers is. 
The teacher would have to discuss the item with the class to detenmne 
where the difficulty lies but one might guess that it is points 1 and 3 
that are causing difficulty in the upper group. 



SUMMARY STATEfctfNT 


Item IV 

The term “easy money** as used m 
economics n>eans 


A. the ability to borrow money at 
low interest rates 

B. dividends that are paid on com- 
mon stocks. 

C. money that is won in contests. 

D. money paid for unemployment 
compensation. 

(Omit) 


Upper Lower 
21 17 

0 0 

0 0 

4 8 

0 0 


93 


This item shows some discriimnation in the desired direction (21 
versus 17), but the differentiation is not very sharp. The response 
pattern is one that is quite common. Only two of the four choices 
are functioning at all. Nobody selects either the B or C choices. If 
we wished to use this item again, we might try substituting “wages 
paid for easy work'* for option B and “Money given to people on wel- 
fare” for option C. The repeat of the word *‘easy’' in option B and 
the idea of getting money for not working in option C might make the 
item more difficult and more discriminating. 

Item statistics such as these can be used not only for evaluating the 
items but to guide review and rcstudy of the material with a class. 
The items that prove difficult for the class as a whole provide leads for 
further exploration. Discussion of these items witli the class should 
throw light on the nature of the misunderstanding. The misunder- 
standing may in some cases be cleared up by brief further discussion, 
although in some cases a fuller review of the topic may be indicated. 
It is desirable, if local policies permit, to let pupils have their answer 
sheets and a copy of the test and to make the answer key available to 
Uiem, so that they can themselves use the lest as a guide to review and 
clarification of the points they missed. An examination should teach 
as well as test. 


SUMMARY STATEMENT 

The deficiencies of essay examinations have led to the preparation 
of tests made up of objective short-answer questions. These questions 
may be prepared in true-false, completion, multiple-choice, malchins, 
and many other forms. Experience of item tvriters has led to the for- 
mulation of a number of “doV and “don’ty to snide the pteparauon 
of test items. These are considered in delail in this chapter. 



PREPARING OBJECTIVE TESTS 

Though th=r= is on unfortunate tendency for writers of obi^tive 
items to concentrate on factual infonnation. ability 
terpret, and apply can be tested by items that follow this • 
the measurement of understanding it is often desirable to e 
fairly complex problem situation or to present a fairly ful set 
and to organize a set of related questions about the problem or data. 

Illustrations are provided. , 

It helps, in producing a good test, to prepare extra items 
have the items edited and screened before using. Items shou 
grouped so as to emphasixe relationships and to provide a gene ^ 
prosression from easy to more difficult. Answer sheets and 
stencils facilitate scoring. The issue of correction for guessing s ou 
be resolved in advance, and examinees should be told what proce ure 
Vrill apply. , 

Test results can be analyzed with profit to pide (I ) further teaching 
and review and (2) the construction of additional tests in later years. 


SUGGESTED ADDITIONAL READING 

Dressel. Paul L-, and Lewis B. Majhew, Science reasoning and under- 
standing, Dubuque, Iowa, William C. Brown, 1954. ^ 

Ebcl, Robert L., Writing the test item. Chapter 7 in E. F. Lindquist, 
tor. Educational meoiuremeni, Washington, D. C., American Council on 
Education. 1951. 

Gerberich, J- Ra>-raond, Specimen objective lest items. Sew York, Long- 
mans, Green, 1956. 

Micheels, William I., and M. Ray Karnes, Measuring educational achieve- 
ment. New York, McGraw-Hill, 1950. 

National Society for the Study of Education, The measurement of under- 
standing. The Forty-Fifth Yearbook. Part I, Chicago, Illinois, Unisersity 
of Chirago Press, 1946. 

Odell, C. W., How to improve classroom testing, rev. cd-, Dubuque, Iowa. 

Wiiliani C. Brown. 1958, chapters VII-XIII. 

Trailer, Arthur E., Administering and scoring the objective test. Chapter 
10 in E. F. Lindquist, Editor. Educational measurement, Washington, 
D. C., American Council on Education, 1951. 

Wood, Dorothy Adkins, Test construction, dei elopment and interpretation 
of achievement tests, Columbus, Ohio, Charles E. Merrill, 1960. 

QUESTIONS FOR DISCUSSION 

1. A lugb-school principal has a system of using a different type of objec- 
tive lest item each month— one month it is true-false, the next month mul- 
tiple-choice, the next mouth completion, and so on. Each teacher is ex- 
pected to follow this uniform pattern. How would you evaluate this pro- 
cedure? Why? 



QUESTIONS EOft DISCUSSION 95 

2. What steps can a teacher take to avoid ambiguous items on an objec- 
tive test? 

i. Under whai conditions uould it be important to correct scores on an 
objective test for guessing? 

4, Collect some examples of poor items you have seen on tests. Indicate 
what is wrong with each item. 

5. Construct four tnultipic-choicc items designed to measure understand- 
ing or application m some subject area in which you arc interested. 

6, Prepare a short objective test for a small umt that you are leaching 
or plan to teach. Indicate the objectives that you are trying to evaluate 
W'ith each item. (Use the blucpnnt from Question 3, p. 58 if one is avail- 
able.) 

7. What are the arguments for and against returning major examination 
papers to students? 

S. A fourth-grade teacher has given a lest in arithmetic. What analyses 
of the results could the teacher make that would help guide (a) future 
work for the class as a whole and (b) special assistance given lo individual 
pupils? 

9. A college teacher has given an objective test lo a large class, scored 
the papers, and entered the scores in the class record book. What further 
steps might the teacher take before returning the papers to the students? 
Why? 



Chapter 5 

T 

Elementary Statistical 
Concepts 

JNTRODUCUON 

In its various forms, measurement results tn classification, rankings, 
or scores. Any attempt to describe, summarize, or compare results 
for individuals or for groups calls for numerical treatment. Th^ 
branch of arithmetic and mathematics that deals with the analysis of 
sets of scores for groups of individuals is known as statistics. Every 
user of tests and measurement devices needs at least a consumer’s un- 
derstanding of the basic objectives and techniques of descriptive sta- 
tistics. This is a book on measurement, not a statistics textbook. Dis- 
cussion of statistics as such is timlled to this one chapter. It cannot 
be expected that study of it will make the reader an accomplished sta- 
tistician. This chapter points out to the novice some basic types of 
questions that the statistician tries to answer, and introduces him to 
the simplest tools used to answer them. 

Suppose you have prepared tests in reading, arithmetic, and spell- 
ing and given them to the pupils in two sixth grades in your school. 
You have scored the papers and entered the names and scores on a 
record sheet for the two classes. Table 5.1 shows the way the record 


Toble 5.1. Record Sheet for Sixth Grades at School X 


Name 


Test Scores 


Reading 

Arithmetic 

Spelling 

1. Carol A. 

32 

3 

26 

2. Mary B. 

27 

27 

23 

3. Ruby C. 

31 

9 

29 

4. Alice D. 

36 

IS 

27 

5. Theresa E. 

47 

21 

35 

6. Ida F. 

42 

24 

26 

7. Vivian G. 

22 



8. Grace H. 

SO 

42 

32 


96 



INTRODUaiON 


Toble 5.U (Con»»flued) 


Name 


Test Scores 
Reading Arithmetic 


9. Opal 1. 

10. Ursula J 
U. Beatrice K. 

12. Karen L. 

13. Susan M- 

14. Jane N. 

15. Dorothy O 

16. Frances P- 

17. Eliiabclh Q. 

18. Pearl R. 

19. Joan S. 

20. Nancy T. 

21. Judith U. 

22. Edith V. 

23. Louise W. 

24. Helen X. 

25. Martha Y. 

26. Dons Z. 

27. James A. 

28. Albert B. 

29. Donald C. 

30. Peter D. 

31. Samuel E. 

32. George F. 

33. Roger G. 

34. Ncs^ton H. 

35. Karl I. 

36. Isidore J. 

37. John K. 

38. Benjamin L. 

39. Theodore M. 

40. Michael N. 

41. Herman O. 

42. Charles P. 

43. Patrick Q- 

44. William R- 

45. Marlin S. 

46. Frank T. 

47. Ralph U. 

48. Thomas V. 

49. Henry W. 


20 

37 

25 

37 

28 

34 

31 
21 

55 
59 
44 

32 

56 
38 
38 
29 
24 
36 

36 
21 
27 

37 
46 

33 
17 
35 

30 
22 
43 

31 
50 

34 
30 
52 
40 

42 
17 

32 
38 
29 
36 

43 


10 

13 

20 

15 

19 

48 

41 

41 

40 

24 

24 

18 

12 

26 

12 

29 

16 

7 

29 
36 
10 

14 
18 
12 

30 
9 

15 

38 
20 
15 

39 
33 

6 

26 

20 

20 

29 

25 

19 


Spelling 


11 

29 
IS 
23 
25 

30 
22 

17 
23 
33 
29 

18 
39 
21 

29 
27 
22 

30 
25 
14 
16 
21 
32 
27 
17 
29 


33 

20 

30 

20 

19 

36 

26 

32 

11 

18 

22 

24 

27 

.33 

24 



53 EtEMENTARV STATISTICAl CONCEPTS 

sheet might look. Now, what sorts of questions might you ask these 
data? That is, to what questions might you ask the data to provi e 
the answers? Before reading further, suppose you study the set ol 
scores and jot down on a piece of scrap paper the questions that come 
to your mind in connection with these scores. See how many ot tne 
question types you can anticipate. 


A first, rather general question you might ask is: What is the gen- 
eral pattern of the set of scores? How do they “run”? What do they 
“look like”? How can we picture the set of reading scores, for ex- 
ample, so that we ran get an impression of the group as a whole? To 
answer this question we will need to consider simple ways of tabulat- 
ing and graphing a set of scores. 

A second type of question that wiU almost certainly arise is: What 
is this group like, on the average? Have they done as well on the test 
as other sixth-grade groups? Are they ready for the regular sixth- 
grade instruction and materials? What is the typical level of perform- 
ance in the group? All these questions call for some single score to 
represent the group as a whole, some measure of the middle of the 
group. To answer this question we shall need to become acquainted 
with statistics developed to represent the average or typical score. 

Third, in order to describe your group you might feel a need to de- 
scribe the extent to which the scores spread out away from the average 
value. Arc all the children in the group about the same, so that the 
same materials and procedures would be suitable for all? If not, how 
widely do they spread out on a given test? How does this group 
compare with other classes with respect to the spread of scores? This 
calls for a study of measures of variability. 

Fourth, you might ask bow a particular individual stands on some 
one lest. Thus, you might want to know whether James A. had done 
well or poorly on the arithmetic test, and if you decided that his score 
was a good score you might want some way of saying just how good 
it was. You might ask whether James A. did better in reading or in 
arithmetic. To answer this question wc need a common yardstick in 
terms of which to express performance in two quite different areas. 
Our need, then, is for some uniform way of expressing and interpreting 
the performance of an individual. How docs he stand, relative to his 
group? 

A fifth query' is of this type: To what extent did those who excelled 
in reading also excel in arithmetic? To what extent do these two abili- 
ties go together in the same individuals? Is the individual who is 



WAYS OF lAtUlATING AND PICIUTONG A SET OF SCORES 

:j=S;=:;=«S=3 

these questions. There j ’nt ones concern the drawing 

respect to a set of data. lUemo p group. Thus, one 

of general conclusions from da a -nle fifty girls from the 

sanfplc of ftfry boys ntay have ^ « .rue of these 

same school on a whether we can safely con- 

particular groups. We wou sample was 

elude that the ,o,al popuhnon of j^y f on m 
drawn would surpass the tota statistical inference make 

This is a problem of injerence- j„ro them 

up the bulk of advanced statistical work, but 

here. 

siAth-grade pupils had they ceuld be rear- 

the column headed e oI how the puptls bui **"= 

ranged so as to give us a clearer p 

reading lest. vvould be merely to arrange 

Tlte simplest rearrangement ^ have something that 

order from highest to lowest. Wc vvoin 
looked like this: 



lest and lowesi s group 


falls somewhere in t 



100 ElEMENTARY STATISTJCAl CONCEPTS 

roughly half the scores fall between 30 and 40- But this simple re- 
arrangement of scores still has too much detail for us to see the gen- 
eral pattern clearly. It is also not a convenient form to use in com- 
puting. We need to condense it into a more compact form. 

PREFARiNG A FREQUENCY DISTRIBUTION 

A further step in organizing the scores for presentation is to pre- 
pare what is termed a frequency distribution. This is a table showing 
how often each score occurred. Each score value is listed, and the 
number of times it occurred is shown. A portion of the frequency 
distribution for the reading scores is shown in Table 5.2. However, 


Toble 5.2. Frequency DlUribuiiott of Reading Scores 


(Uogroupcd Data) 


Test Score 

Frequency 

59 

1 

58 

0 

57 

0 

56 

I 

55 

0 

54 

0 

53 

0 

52 

I 

51 

ti 

50 

t 

20 

1 

19 

0 

18 

0 

17 

2 


Tabic 5.2 is still not a very good forni for reporting our facts. The 
table IS tM long and spread out. We have shown only part of it. The 
whole labk would take 43 lines. It would have a number of zero 
entries. There would be roarltcd ups and downs from one score to 
the nc:^t. 

In order to improve the form of presentation further, scores are 

air, 7 “ *at each grouping in- 

cludes three points of score. When we do this, our set of sLres is 



WAYS OF TABUIATING AND PIOURING A SET OF SCORES 101 
1 n in TiWe 5 3 This provides a fairly compact 

SrrshoTving'how many scoias There arc in each 8™“P j,"'"', 

L; Thus we have eight scores in the interval 34-36. oo nor 

know how many of them arc ^ \ve assum^S they 

We love lost •Ws intarmauon m Uie group 
arc evenly divided. In m 

that any one score will . ^o^pjiMncss and convenience 

sumption is a sound one, so the g P |„jcp„racy intro- 

or presentaUon more than mal-e up for any siign 
duced by this grouping.* 

Table 5.3. Frequenc, DiUribation of Reading Stares 
(Gn»u|)C<l Uaw) 



In a practical ^'f^ritetherT^^^^^^ hy 3% S's Iffs, 

tcrval that wih dtv.de the tota -““iS pre- 

1 „ some 


values are more !*>* f „oup,ng 
cauUons arc ncccssa ^ . ^y^ues 
shoald stnve .0 Se" P»P"'“' 



102 


ELEMENTARY STATISTICAL CONCEPTS 


Thus, in cur example the highest score was 59 and the lowest was 
17. Ue range of scores is 59 - 17 = 42, D.v.d.ng 42 by 5 wc 
pet 2 8. The neatest whole number is 3, and so we group 
fy 3-s in addition to the “rule of 15,” we also find that intervals of 5 
10, and multiples of 10 make convenient groupings. Since the purp 
of grouping scores is to make a convenient representation, factors o 
convenience enter as a major consideration. 

It should be noted that sometimes there is no need to group oaia 
into broader categories. U the original scores cover a range ot no 
more than, say, 20 points, grouping may not be called for. 

In practice, when we are tabulating a set of data, deciding on 
size of the score interval is the first step. Next we set up the score 
intervals, as shown in the left-hand column of Table 5.3. Eac in 
dividual vs then represented by a tally mark, as shown in the mid e 
column. (It is easier to keep track of the tallies if every fifth tally is 
a diagonal line across the preceding four.) The column headed Fre- 
quency is gotten by counting the number of tallies in each score 
interval. 


GRAPHIC REPRESeNrAT/ON 

It is often helpful to translate the facts of Table 5.3 into a pic- 
torial representation. A common type of graphic representation, 
which is called a histogram, is shown in Fig. 5.1. This can be thought 
of, somewhat grimly, as “piling up the bodies.” The score intervals 


T* 

“r 


— [ 

“T 

1 1 1 



- L 1 ! 1 1 




J _l 

Ll_l i 1 1 i.J 

1 1 . 

1 ' ' 
L J- -L 4- H 

L_i i 

i 


f 1 4 -1 H 

■l-i 

ill!!! 

^ 1 ■ 

! ^ 1 J 

1 

! Ti t 1 1 ■ 

1 



L 

L 

! ! i I 

1 i 

-T" 

1 J— 1 

r 


i i i i i 1 T 

1 

r t 

t 

t 

1 1 1 1 1 I 1 

! 1 1. 11.1 

1 1 

i 1 

]__j 1 — 


Score interval 


3.1. Hut^grom of reoding KOrer. 


IMI IKI JCJl 4M U4f (fil SKc' 

Score int»va) 

Rg. 3 } Frequency pelygea eF reedinp Merer. 

Another way of picturing the same data is by preparing a frequency 
polygon. This is shown in Fig. 5.2. Here we have plotted a point 
al the mid-point of each of our score intervals. The height at which 
we have plotted the point corresponds to the number of cases, or fre- 
quency (/), in the imervaJ. These points have been connected, and 
the jagged line provides a somewhat different picture of die same set 
of data illustrated in Fig. 5.1. Histogram and frequency polygon are 
essentially interchangeable ways of showing the same facts. 

MEASURES OF CENTRAL TENDENCY 

We often need a statistic to represent the typical, or average, or 
middle score of a group of scores. A very simple way of identifying 
the typical score Is to pick out (he score that occurs most frequently. 
This is called the /node. If we examine the array of scores on p. 99, 
we see that the score 36 occurs 4 limes and is the mode for this set 
of data. We can also note another fact Tlie score values 38, 37. 
32, 31, and 27 each occur 3 times. If there were I less 36 and J more 
27, for example, Uie mode would shift by 9 points. The mode is 


,04 ElEMENTARV STATISIICAI CONCEPTS 

sensitive to such minor changes in the data and is therefore a crude 
and not very useful indicator of the typical score. In Table 5.3, »here 
tvc have the grouped frequency distribution, the modal interval is the 
interval 34-36. This is as closely as «e can identify the mode for 
data presented in this way. 

MEDIAN 

A much more useful way of representing the t>'pical or average 
score is to find the value on the score scale that separates the top 
half of the group from the bottom half. This is called the median. 
In our example, in which we have 52 cases, vve want to separate the 
lop 26 from the bottom 26 pupils. The required value can be esti- 
mated from the scores shown in Table 5.3. Starting with the lowest 
score, we count up until we have the necessary 26 cases. The “count- 
ing up” is best done in a systematic way, as shown in Table 5.4. 

Table 5A Frequenqr Olsfribglion ond Cumulolive Frequencies for 
Reoding Scores 

Score Cumulative 

Interval Frequency Frequency 


58-60 

I 

52 

55-57 

1 

51 

52-54 

1 

50 

49-51 

2 

49 

46-48 

2 

47 

43-45 

3 

45 

4a-;2 

3 

42 

37-39 

7 

39 

34-36 

8 

32 

31-33 

7 

24 

28-30 

5 

17 

25-27 

4 

12 

22-24 

3 

8 

1^21 

3 

5 

16-18 

2 

2 


Tabic 5.4 shoav the curaublivic frequencies as ueU as the frequency in 
each mlcrsal. Each entry in the column labeled Cumulative Fre- 
quency shotts the total number having a score equal to or less than 
the ht^Kt score in that imervaL That is, there are 5 cases scoring 
at or bclo-v .1, 8 scoring at or belo* 24, 12 scoring at or below 27, 
and so forth. As indicated, we wish to identity the point below which 




MEASURES Of CENTHAl TENDENCY 

50 per cent of .he cases fall Sioce 50 per .n. of 52 = 26, we cus. 
identify the pom. bciow which 26 pupifs faH. 

We note that 24 No e that 

.0 include 2 more cases « f.Sals. We re- 

in the , 'tivlai;. Now how shall w= thinlt of 

quire only -/a or ’,4 of tlicse tniu interval 34-36’ As we 

" ref" 

ru.d:^ltrh:vU“oV-^^ W., up from .he hot.oa. of the inter- 

val toward the top. ^ ^ score of 34. In 

A. Utis point we -“'f -res go by .untps of 
the first place, let us note Ota. aimo g httve a 

1 unit, i.e., tte^elte ™1«. nuts, we do 

continuous dtstrihution taki g because our test docs not 

not get a score of 34.27, 1” ,),a, a score of 34 means 

register that precisely. 0“ .pjrt is, 34 wiU mean from 33% 

closer to 34 than to “'fbat arbitr ry but is rather genertd y 

,c 34%. This >i=""-"°”;^““''ot lss interval 34-36 is real y 
accepted in stat.sucs f “33,4 ,0 36'4. Since we requtre 
,0 be thought of as extending _ 33'4) = 'A X 

% of the cases in tins "’“"j’ „ die value 33'/., which is the 
t” median lor this set of scores 

borderline between the 2 im 
is 33.5 +0.75 = 34.25. 

To compute the median, then, .ba 

1 . Calculate the nurob.r of 52 is 26. 

total group, in our exmnple 50 n.^c 

‘^Sndis-d^^ 

in order to include thu H 
operation: . 

ANumbero^^ 

( Number of wscs tn ^ „ 75. 

our exampie Otis 


ndWNuntber of scoreX 

rApoints 



10^ ELEMENTARY STATISTICAI CONCEPTS 

5 Add this amount to the upper limit of the interval. We have 
for our data 33.5 + 0.75 = 34.25. This score is the median, the 
score below which 50 per cent of the eases fall. 


PERCENTILES 

The same procedure may be used to find the score below which 
any other percentage of the group falls. These values are all ca c 
percentiles. The median is the 50th percentile, i.e„ the score below 
which 50 per cent of individuals fall. If we want to find the 25lh 
percentile, we must find the score below which 25 per cent of t e 
cases fall. Twentyfivc per cent of 52 is 13. Thirteen eases take us 
through the interval 25-27, and Include 1 of the 5 eases in the 28-30 
interval. So the 25lh percentile is computed to be 27.5 4* (Vf.)3 = 
27.5 + 0.6 = 28.1. Other percentiles can be found in the same 
way. Percentiles have many uses, especially in connection with test 
norms and the interpretation of scores. 

ARITHMETIC MEAN 

Another frequently used statistic for representing the middle of a 
group is the familiar “average” of everyday experience. Since the 
statistician speaks of all measures of central tendency as averages, he 
identifies this one as the arithmeiic mean. This is simply the sum of 
a scries of scores divided by the number of scores. Thus, the arith- 
metic mean of 4, 6, and 7 is 

4+6 + 7 

= 5.67 

3 

In our example, we can add together the scores of all 52 individuals 
in our group. This gives us 1798. Dividing by 52, we get 34.58 for 
the “average” or arithmeiic mean for this group. 

Adding together all the scores and dividing by the number of cases 
is the straightforward way of computing the arithmetic mean. If the 
group is fairly small, and especially if an adding machine is available, 
it may be the best way. However, it can be rather laborious, espe- 
cially with a large group. More efficient computing procedures are 
available, based on the frequency distribution given in Table 5.3. 
These calculations are based on a type of “trial balance.” Picking a 
score interval that looks to be about in the middle of the group, we 
sum the plus and minus deviations from this starting place. An ad- 
justment based on the excess of plus or minus deviations and applied 



measures of CQ^rRAL TENDENCY 


?07 


Table 5.5. Frequency Di»tnb«rion of 
ColculoJing AriJhmeMc Meon 


Reading Scores Showing Sieps in 
and Stondord Devlotlon 


Score Frcqttcitcf 

Interval / 


58-60 1 
55-57 ) 
52-54 { 
49-51 2 
46-48 2 
43-15 3 
40-42 3 
37-39 7 

34-36 


fx’ /(xT- 

* 8 64 

7 7 49 

6 6 36 

5 i» 50 

4 8 32 

3 9 27 

2 6 12 

» 7 7 

+61 

0 0 0 


31-33 

28-30 

25-27 

22-24 

19-21 

16-18 


7 


-7 

7 

5 


-10 

20 

4 

-3 

i: 

36 

3 

-4 

-i: 

48 

3 

-5 

-15 

75 

2 

-6 

-12 

72 


-68 


-7 535 


10 this starling place gives the value for ihe mean. The application 
of Uiis procedure to the reading test data is shown in Table 5.5, and 
the steps are outlined below. 

1. Choose some interval for the arbitrary starting place or "origin.” In 
this example the interval 34-36 has been chosen. Call Ibis interval zero. 
(Note: Any inicrval can be chosen, and the final result will be the same. 
The particular interval chosen is purely a matter of convenience.) 

2. Call the next higher interval +1. the one above that -f-2, etc.; call 
the next lower -I, the one below that -2. etc. These are shown in the 
column labeled x'. This column indicates the number of interval steps 
each interval is above or below our chosen starting point. 

3. For each row, multiply the number of cases (frequency) by the num- 
ber of Steps (x') above or below ihc chosen origin. These products give 
the values in the column headed fx\ Note the minus signs in ibe lower 
half of the column. (Ignore the coJuinn headed /fx')- for now. It refers 

to a later topic.) 

4. Sum the values in the fx" column, lading account of the plus and 
minus sigav. (AlistaJrcs will be avwKfcd if the plus entries are summed 
separately, the minus entries summed, and then the two part sums com- 
bined to give the final total.) 



,53 EIEMENIARY STATlSTlCAt CONCEPTS 

5. Sum the Irequeucics in the column headed '“I, « 

negative, adding it becomes in effect subtraction.) 

These operations can be expressed by the following formula. 

= ^Suinot jy^ (Interenl) + Arbitrary origin 

In our illustration the values become 


'•■““(it)* 


= (-0.134)(3) + 35 
= -0.40 + 35 


= 34.60 

Starling where we did, the minus deviations slightly overbalanced 
the plus ones. There was an excess of 7 on the minus side. Our 
starling point was a little too high. We had to shift it down of * 
interval or X 3 points of score to find a true balance point. Since 
the middle of our zero interval corresponded to a score of 35, we had 
to move down points below 35 to get the true balance point, the 
correct arithmetic mean. 

The value 34.60 that we got in this way is almost the same as the 
34,58 that resulted from adding all the scores together and dividing 
by the number of cases. The correspondence is usually not perfect, 
due to slight inaccuracies involved in grouping our scores into classes 
in the frequency distribution, but the values obtained by the two 
methods will always agree closely. It makes no difference which in- 
terval we use for our starling point. Barring mistakes in arithmetic, 
we will always get identically the same result. 

The arithmetic mean and the median do not correspond exactly, 
but usually they will not differ greatly. In this example, the values 
are 34.60 and 34.25, respectively. The mean and median will differ 
substantially only when the set of scores is very “skewed,” i.e., there 
is a piling up of scores at one end and a long tail at the other. Fig- 
5.3 shows three distributions differing in amount and direction of 

• A li&t of common stathtical symbols and iheir meanings is given at the end 
of the chapter. Reference to these definitions may help in reading the remain- 
der of the chapter. 



■MEASURES OF VARIABIUTY 




Negativtly skewed 



Fij. J.3. Freqwency diilriButient diRtring fn tk*wn»ii. 


skewness. The (op figure is positively skeweti, i.e., has a tail running 
up into the high scores, We get a distribution Jike this for income 
in the United Slates, since there arc many people with small and mod- 
erate incomes and only a few with very large incomes. The center 
figure is negatively skewed. A distribution like this would result if a 
class was given a very easy test, which resulted in a piling up of 
perfect and near-perfect scores. The bottom figure is symmetrical and 
is not skewed in either direction. Many physical and psjrhological 
variables give such a symmetrical distribution. In the many distribu- 
tions that are approximately symmetrical cither mean or median will 
serve equally well to represent (he average of the group, but with 
skewed distributions the median generally seems preferable. It is less 
affected by a few cases out in the long tail 

MEASURES OF VARtABItITY 

^Vhen describing a set of scores, it is often significant to report how 
variable the scores are, how much they spread out from high to low 
scores. For example, two groups of children, both with a median age 


,,5 ElEMENtARY STATISIICAI CONCEPTS 

Of 10 years, would represent quite different edueational “ 

one had a spread of ages from 9 to 11 while the other ranged from 
6 to 14. A measure of this spread is an important statistie lor d 


A v!ty simple measure of variability is the range of scores m Ac 
group. This is simply the difference between the highest and the 
lowest score. In out reading test example it is 59 - 17 = 42. How- 
ever, the range depends only upon the 2 extreme cases in the totm 
group. This makes it very undependable, since it can be change 
a good bit by the addition or omission of a single extreme case. 


SEMI-JNT£RQUARnL£ RANGE 

A better measure of variability is the range of seores that includes 
a specified part of the total group— usually the middle 50 per cent. 
The middle 50 per cent of the cases in a group are the cases lying 
between the 25th and 75th percentiles. We can compute these two 
percentiles, following the procedures outlined on pp. 105-106. Forour 
example, the 25ih percentile was computed to be 28.1. If we calcu- 
late the 75th percentile, we will find that it is 39.5. The distance be- 
tween them is 11.4 points of score. 

The 25th and 75th percentiles are called quariiles, since they cut 
off the bottom quarter and the top quarter of the group respectively. 
The score distance between them is called the interquarlile range. A 
statistic that is often reported as a measure of variability is the 
interquartile range (Q). This is half of the interquartile range. It is 
the average distance from the median to the 2 quartiles, i.e., it tells 
how far the quartilc points lie from the median, on the average. In 
our example, the semi-interquarlile range is 



If the scores spread out twice as far, Q would be twice as great; if they 
spread out only half as far, Q would be half as large. Two distribu- 
tions that have the same mean, same total number of cases, and same 
general form, and that differ only in that one has a variability twice 
as large as the other arc shown in Fig. 5.4. 

STANDARD DEV/ATJON 

The semi-interquartile range belongs to the same family of statistics 
as the median. Its computation is based upon percentiles. There are 
also measures of variability that belong to the family of the arithmetic 



MEASURES Of VARMBJUry 




SinaO variabiljiy 

Fig. S.i. Two 4i»iributiont difer>Ag only In varMilily. 


mean and are based upon score deviations. Suppose we had 4 scores 
which were 4, 5, 6, and 7 respeciivciy. Adding ihcsc together and 
dividing by liie number of scores we get 

4 + 5 + 6 + 7 21 

4 “ 4 * 

This gives us the arithmetic mean. But now we ask how widely these 
scores spread out around that mean value. Suppose we find the dif* 
fcrcnce between each score and the mean, i.e., we subtract 5.5 from 
each score. We then have — 1.5, —0.5, 0.5, and 1.5. These repre- 
sent cfevtar/o/is of the scores from the mean. The bigger the devia- 
tions, the more variable the set of scores. What wc require is some 
type of average of these deviations to give us an ovcr-all measure of 
variability. 

If W'e simply sum the above 4 deviation values, we find that they 
add up to zero. This is necessarily so. Wc defined our arithmetic 
mean as the point around which the plus and minus deviations exactly 
balance. We shall have to do something else. The procedure that 
statisticians have devised for handling the plus and the minus signs 
is to square all the deviations. (A minus times a minus is a plus). 
An average of these squared deviations is obtained by summing them 
and dividing by the number of cases. To compensate for squaring 
the individual deviations, the square root of this average value is com- 
puted. The resulting statistic is called the .uw/iarj licM'an'on (SO 



,,2 ELE/AENTAIW STATISTICAL CONCEPTS 

nr sT It is the square root • of the average of the squared 
from'the raean \or our little example of 4 cases, the calcuiauons 

are as follows: 


SD 


h-l.sf + 

* 

( 2.25 + 0.25 + 0.25 + 


/4 


= vTii = 1.12 

STANDASD DEVIATION COMPUTED «OM fPEOUENCY DISTSISUTION 
The Standard deviation may also be computed from the youped 
frequency distribution. The necessary steps have been uam“ “JU 
in Tabie 5.5. Take special note of the column headed /(x')-. Men 
entry in this column represents the number of cases (/) multip le 
by the square of the deviation (x") of that score interval from the 
arbitrary origin. The sum of the values in this column gives a sum 
of squared des-iations, but these deviations arc around our arbitrary 
origin and are expressed in interval units. Several adjustments are 
necessary to express the deviations in score units and in terms of the 
true arithmetic mean. The steps are outlined below. 

1. Carry our the oireratioirs for roaipuling rhe arithmetic mean, as described 
on pp. 106-103. 

2. In addition, prcjiarc the column headed /(x')*. liCach cnlr^' in this column 
is the frequenej’ (0 limes the square ol the deviation value (xO- Ho»e\^- 
this last column can be computed most simply by multiplying together the 
entries in the two preceding columns, i.c., x' times Jx^. Note that all the 


signs in this column arc posiiise. since a minus times a minus gives a 1 


In symbolism 

IllustraiiTe example 

3. Get the sum of the 
/(lO* column. ("The 
sum ol” will be indi- 
cated by -.) 

ri(x0= 

535 

4. Dlside this sum by 
the number ol cases. 

s 

— -10.28S 

52 

5. Disuic the sum of the 

Jx columa by the 
numbsT ol ca-kes. 

r/y 

.V 

-0.133 

52 


The steps for computing the square root are shown m Appendix I. 



MEASURES OF VARIABIUTY 

/« sy«iboliS,>, niustralhc r,ampl<: 

6. Square the value ob- (■ 51 ) =C-0-133)^ 

tainedinSabove. \ A / =0.018 

7. Subtract the value in 

6 from that in 4. 


5H_^—V = 10.288-0.018 

-10.270 

8. Take the square root 
of the value in 7. 

\T t y 

VT 0 T 7 O- 32 O 

9. Multiply by the num- 
ber of score points in 
each class intcrv.il. 
(Wecall thiswidlhof 

■V-'S V. -V I 

' 3(3.201-0 01) 


Presenting all the ' 

using the formula given in step 


» 9.60 


inikpseiing the spanoapo „hat the suniJapJ 

I. U almost impossible to say m any stmp ,„„s. 

pose that tor f ““P Jf ,„ds on »heth=r »= t»= 

this large or ^^jloLms. I‘ depends “P°” " and 

“d^rng ^hh r w'^f«VLtTng-iriarr- 

small have only some other test. 


fvsnHard deviation gc« ^-uj, distribution is can 

.artleui:, type by a Pf 

he -normar’ dWnbnt.on U ts^^ j^j„,d apP® 
equation, but to curve is a symine n 

fh"sCa: td" 



ELEMENTARY STATISTICAL CONCEPTS 



out to relatively long tails on either end. An illustration of a typica 
normal curve is shown in Fig. 5.5. This curve is the normal curve 
that best fits the reading test data we have been using as an illustration. 
It has the same mean, standard deviation, and total area (number of 
cases) as the reading test data. The histogram of reading test scores 
appears in light dotted lines, so one can see how closely the curve fits 
the actual test scores. 

For the normal curve, there is an exact mathematical relationship 
between the standard deviation and the proportion of cases. The 
same proportion of cases will always be found within the same stand- 
ard deviation limits. This relationship is shown in Table 5.6. Thus, 
in any normal curve about two-thirds (68.2 per cent) of the cases 
will fall between -|-l and —I standard deviation from the mean. 
Approximately 95 per cent will fall between -1-2 and —2 standard 

Table 5.6. Proportion of Coses Foiling within Certain Specified 
Standard Deviation limits for a Norma! Distribution 


Per Cent 

Limits within Which Cases Lie of Cases 


Between the mean ami either 


+ l.O.SDor -1.0 5D 

34.1 

Between the mean and either 


+2.0 SD or -2.0 SD 

47.7 

Between the mean and either 


+3.0 SD or -3.0 SD 

49.9 

Between +1.0 and -1.0 SD 

68.2 

Between +2,0 and -2.0 SD 

95.4 

Between +3.0 and -3.0 SD 

99.8 



MEASURES OF VARIABILITY 

ir.:.’™ . — 

84 per cent of the group te ^ 
below the mean and the 34 per cen 

+ 1 standard deviation deviation unit to the 

This unvarying relationship of . ^le standard 

arrangenten. of become a yardsUck in 

deviaUon a type of standard mea & ,8e status of a 

terms of lahich different group ^uj^ough the relationship of the 
given individual may be eralualed^ jup^baaon does not hold exactly 
Standard deviation unit to *■=;“« ion, fre,uently the dis- 

in distributions other than A approaches the normal curve 

‘etreulJ^b srr;:= — continues to have very 

or below the mean may be eipr ^ number of values for 

in Ute group whom the J^r-phis table provides a basis 

this relationship are consider the set of 

for interpreUng any „aan and standard deviation to ^ 

scores for which we computed the ^ 40 , Since 

rr « sss - ■“ ■ ” 

per Cent 
Having Scores 

beto"' Tbb 

Deviation Value 


+ 3.0 

99.9 

99.4 

+ 2 ^ 

97.7 

+ 2.0 

93.3 

+ 1.5 

84.1 

+ 1.0 

69.1 

+ 0.5 

50.0 

0.0 

30.9 

- 0.5 

15.9 

- 1.0 

6.7 

- 1.5 

2.3 

- 2.0 

0.6 

-24 

0.1 



; the 


JI6 ELEMENTARY STATISTICAL CONCEPTS 

the mean of the group. The 5.4 points by \vhich he surpasses 1 
mean is equal to 5.4/9.6 = 0.56 standard deviations. He is 0.50 
standard deviations above the mean. Wc might expect him to surpass 
approximately 71 per cent of the cases in our group. (An actual 
count shows that this score is better than ~ 75 per cent of the 
scores in our set of data.) A score expressed in standard dcviauon 
units has much the same meaning from one set of scores to another, 
and these units arc directly comparable from one measure to another. 

In summary, the statistics most used for describing the variability 
of a set of scores arc the scmi-intcrquanilc range and the standard 
deviation. The semi-interquartile range is based upon percentiles, 
i.e., the 25th and 75th percentiles, and is commonly used when the 
median is being used as a measure of the middle of the group. 'Hie 
standard deviation Is a measure of variability that goes with the arith- 
metic mean. It is useful in the field of tests and measurements pri- 
marily as providing a standard unit of measure having comparable 
meaning from one test to another. 


INTERPRETING THE SCORE OF AN INDIVIDUAL 

The problems of interpreting the score for an individual will be 
treated more fully in Chapter 6, when we turn to test norms and units 
of measure. It will suOice now to indicate that the two sorts of 
measures we have just been considering, i.e., percentiles and standard 
deviation units, each give us a framework in which we can view the 
performance of a specific person. Thus, referring to the example 
we worked out, if a new boy in the class got a score of 40 on the 
reading test we could say either 

a. That he surpassed 75 per cent of the group, i.c., that he fell at 
the 75th percenUle, or 

b. That he fell 0.56 standard deviations above the mean. 

Either statement gives his score meaning in relation to his group; he 
is somewhat above average but not one of the best ones in the group. 
Since they are based on the same score, they are two ways of saying 
the same thing. Each has certain advantages, which we will examine 
more carefully in Chapter 6. 


MEASURES OF RELATIONSHIP 

We look now for a statistic to express the relationship between two 
sets of scores. Thus, in our Qiustration we have a reading score and 



MEASURES OF RELATIONSHIP 
an arithmetic score for each pupil, 

who did well m anthmetic also “ ^ , We can picture these 

ease, we have two scores for each pP, 5,5, The 

scores by a plot in two „„ reading test of 

first person in our S™“P; 3 Her scores are represented 
32 and a score on the ar thmet t 

^^d‘ m 3 on!ife h— or arithuteuc scale. There is a dot to 

wiU and his scores «Pf “““* ”,^1 on both tests whl fall at the 
of our picture. A child who P * , g^js with poor score 00 
lower left. ^Vhere good ^ ,|,/ofl,£r comers, i.e„ upper 

Ute other, we will Ihid the poiuts faUing ^ 

let, and lower right losver-iet. .0 upper-rigM *ecUou 

for the scores to uiere ate many excepuons. The 

i.e.. from low-low to high-h^. ®“‘““a„er of degree. We need 

reladonship is far from Uiis degree of relauonship^ 

“^riudl^^rie^rt^tSouship, a statistic hnown as 




}]S ELEMENTARY STATISTICAL CONCEPTS 

correlation coefficient can be computed. (The symbol r is used to 
designate this coefficient.) This coefficient can take values ranging 
from 4- 1 through zero to — 1 . A correlation of 4- 1 signifies that the 
person who had the highest score on one test also had the highest score 
on the other, the next highest on one was the next highest on the 
other, and so on, exactly in parallel through the whole group. A cor- 
relation of — 1 means that the scores go in exactly the reverse direc- 
tion, i.e., the person highest on one is lowest on the other, next high- 
est on one is next lowest on the other, etc. A zero correlation rep- 
resents a complete lack of relationship. In-between values of r rep- 
resent tendencies for relationship to exist but with many discrepancies. 

Figure 5.7 illustrates four difierent levels of relationship. In box 
A the correlation is zero, and the points scatter out in a pattern that 
is just about round. Alt combinations are found — high-hi^, low-low, 
high-low, and low-high. Box B corresponds to a correlation of 4-.30. 



fig. 5.7. .1 „„„ 



MEASURES OF REIATIONSHIP 

You can see a baecly perceptible for .he 
low-low and high-high direcrion. Th= ^ency ‘s 

box C which represents a correlation of +.60. In box u, F 
box t., waaru 1 F marked. But 

::Z r a — as d^.e .... spr.™u^ 

bit and do not fcUow an exact Ime I””'’” “J'trrefpond to 
may rrote in passrng “ for conrpnting the cor- 

a correlatron cocfflcient of -h. ■ ,j f pjogj. readers who 

relation coefficient are onUtned rn l 

wish to carry out the calculatrorrs »t correlation coefficients 

There are two imporbrn. ” »^“;;„“ZasuremenU. The 

will be encountered in connK i ^ determine how pre- 

■ first situation is one in whi cXre is. Thus, if we wanted 

cise and consistent a measurement pwe^« ^ 5„.y,cd 

,0 know how consistent “ “““'‘ “'.hcZistance twice, perhaps on 
dash, we could have each child 

successive days. The corre “ pj (bb measure of running 

formation on the precisirin studying the rela- 

speed. The second ^i'^rlr r^IasUs often in order to evaluate 
tlonship between two different im ^ ^lody a scho- 

one as a predictor of the oih”- ® odes. The correlation 

“ -^J^Ze Ure problem. In e^ 

we obtain. Suppose the two Suppose the aptitude es 

rclalicn of .80, Is .“'e be pleased or dUcoumged? 

correlates .60 with college "j pg, 57. Clearly, the higher 

The answer lies in part rn the P“>“ “ ^ s with the other. If 

the correlation, the more ° be diagonal line troat ™ 

we think of discrepances away frun> l- „ ,n, correlation 


me COlTCiauu**, — , „„ from the 01050“*“ . 


formance differs a good deal „„c,a,ion coeffi- 

,.,l„cs that arc commonly 


us.; . 3d., and any given 

However, everything is to value 

cient must be interpreted m comp 



,20 ELEMENTARY STATISTICAL CONCEPTS 

obtained. Table 5.8 contains a nuntber of 

have been reported for different types of variables. The nature 

Lres being correlated is described and the 

examination of this table wiU provide some initial '>=‘='=8™“ . 

terpreling correlation coefficients. The coefficient wdl 8^^ 

on added meaning as the reader encMuntcrs coefficients of 

sizes in his reading about and work with tests. 


Table 5.8. Correlation Coefficients for Selected Variables 
Correlation 

Variable Coefficient 


Height of identical twins -95 

Intelligence of Identical twins >88 

Height versus weight *60 

Intelligence of siblings -53 

Height of siblings *50 

Strength of grip and speed of running .16 


Height versus Bind IQ *06 

Height versus educational achievement .01 
Shape of head versus intelligence .01 

Height versus sociability .00 

No. of physical defects among boys 
versus school progress —.29 


SUMMARY STATEMENT 

We opened this chapter by pointing out the various kinds of ques- 
tions we might wish to answer by referring to a set of test scores. 
us look at these questions again and sec what answers we have offered 
for them. 

1. How Do Our Scores “Rim"; What Do They “Look Like”? To 
answer this question, we can arrange our scores into a frequency dis- 
tribution (Table 5.4) or plot them in a histogram (Fig. 5.1). 

2. What Score Is Typical of the Group; Represents the Middle of 
the Croup? To represent the middle of the group we may calculate 
the median — the 50ih percentile (pp. 105-106), or the arithmetic 
mean — the common average (pp. 106-108). 

3. How Widely Spread Out Are the Scores: How Much Do They 
Scatter? To represent the spread of scores statisticians have devel- 
oped (1) the scmi-interquaitilc range, half the distance between the 
25th and 75th percentile (p. 110), and (2) the standard deviation 



SIATISIICAI SYMBOLS 

(pp, 111-113), a type of average ot the devialions of the scores away 
from the average. , Individual 

fuller discussion in Chapter 0, ocrcentile rank, the per cent 

Extent Are the Same Indivtduah H g ^ numerical index 

of relaUonship is given by the ,„de 3 t is important as 

of “going-togetherness” (pp. “ ‘ ^ ^ as describing the 

school grades or job success. 

STATISIICAL symbols 

I k«ftV« deaJina 'vilh tests, or 
The student who reads 'jounials will encounter a num- 

articles about tesUng m statistical concepts or opera- 

ber ot conventional symbols to rrfeMo jafi- 

Uons. Some of the “^ ebaptens ot this book, as well as 

nitions should help in reading later enp 
outside references. 

Ocfin<n£>n 

„ The .0U.;--„»iS.Ti?S:erS a tpcc,r.c score or 

f Frequency. interval. 

;t. A mw sC‘ on "" 

— - 

‘ The mean of ihe group. 

The median of ^ go pj^ccnule. 

value is Urou /• 

the raw score of peRO 


X or M 
Md 
Qi 
Qz 


A. subscript 



Symbol 
SD or s 


ELEMENTARY STATISTICAL CONCEPTS 
Definition 

Standard deviation of a set of scores. 

Standard deviation in the population, though sometuncs 
used to refer to the particular sample. 

Per cent of persons getting a lest item correct. ^ 

Per cent of persons getting a test item wrong (p + ^ = f’ 
A coefTicient of correlation. 

A reliability coefficient. The correlation between 
equivalent test forms or two administrations of a test. 
“Take the sum of.” 


SUGGESTED ADDITIONAL READING 

Garrett, Henry E., Elementary statistics, New York, Longmans, Green, 
1956. . . 

Guilford, 1. P., Fundamental statistics in psychology and education. Jr 
ed.. New York, McGraw-Hill, 1956. 

Nelson, M. J., E. C. Denny, and A. P. Coladarci, Statistics for teachers, 
New York, Holt, 1956. 

Walker, Helen Nf., and Joseph Lev, Elementary statistical methods, rev. 
ed.. New York, Holt. 1958. 


QUESTIONS FOR DISCUSSION 

1. For each of the sets of scores indicated below, select what appears 
to you to be the most suitable class interval, and set up a form for tallyins 
the scores: 


Test 

No. of Cases 

Range of Scores 

Arithmetic 

84 

8 to 53 

Reading Comprehension 

57 

15 to 75 

Interest Inventory 

563 

68 to 224 


2. In each of the following distributions, indicate (a) the size of the 
class interval, (b) the mid-point of the intervals shown, and (c) the real 
limits of the intervals (i.c., the dividing lines between them). 

(1) 4-7 (2) 17-19 (3) 50-59 

8-1 1 20-22 60-^9 

12-15 23-25 70-79 


3. Using the spelling scores pven in Table 5.1 on p. 96, make a fre- 
quency distribution and a histogram. Compute the median and the upper 
and lower quartilcs. Compute the arithmetic mean and standard deviation. 

4. In the Bureau of Census reports the median is used in reporting 
average income. Why is it used, rather than the anlbmctic mean? 



QUESTIONS FOR DISCUSSION 

5. A 50-Uem vocabulary Fr'—SH 

from 18 to 50. Nt«y-scveyoll l,^ 

dislribulion of scores measure of central tendency would be 

tfauUaWoT ‘w What measure ot variabUi.y would you probably 

“t A blgh-scbool teacher gave two sectior. ot a history class the same 
test. Results were as follows. 


Median 

Mean 

75th percentile 
25ih percentile 
Standard deviation 


Section A 
64.6 

65.0 

69.0 

61.0 
6.0 


Section B 
64.3 
63.2 

70.0 

54.0 
10 5 




John 

Oscar 


31 

84 


standard deviations above or I 

Alice tlS “ 

Willard 56 )>»» approximately normal. 

8. It the Jiatrili"';”" '”t,f‘,rp''would each of '^et' 

about what per cent of the gro p f^^nowing correlation 

9, Explain the meanmg of each ^ 

The correlation between scores on a ,,„w 

tclligcncc test »4 +• . citizenship" and on gg 

b. Ratings of pupils ^ achievement test is 


Ratings oi fut'**' — - 

. 02 . 



Chapter 6 

T 

Norms and Units tor 
Measurement 


THE NATURE OF A SCORE 

Johnny got a score of 15 on his spelling test. What does that mean, 
and how should we interpret U? ^ 

Actually, as it stands it has no meaning at all and is completely 
uninterpretable. At the most superficial level, we don’t even know 
whether this represents a perfect score, i.e., 15 out of 15, or a very 
low per cent of the possible, i.e., 15 out of 50. But even supposing 
we do know that it is 15 out of 20, or 75 per cent, what then? 

Look at Table 6.1. This shows two 20-word spelling tests. A 
score of 15 would have vastly different meaning if it were on test A 
than on test B. A person who got only 15 right on test A would not 
be outstanding in a second- or third-grade class. Try test B out on 
some friends or classmates. You will probably not find many of them 
who can spell 15 of these words correctly. When this test was given 
to a class of graduate students, only 22 per cent of them spelled 15 
of the words correctly. A score of 15 on test B is a good score among 
graduate students of education. 

As it stands, then, a score of 15 words right, or even of 75 per cent 
of the words right, can have no meaning or significance. It gets mean- 
ing only as we have some standard with which to compare it. 

In the usual classroom lest, the standard operates indirectly and 
imperfectly, partly through the teacher’s choice of tasks to make up 
the test and partly through his standards for evaluating the responses. 
Thus, the teacher picks tasks to make up the test that he considers 
to be appropriate to represent the learnings of his group. No teacher 
in his right mind would ^ve test A to a high-school group or test B 
to third graders. Where the responses vary in quality, as in essay 
examinations, the teacher sets a standard for grading that corresponds 
to what he considers it reasonable to expect from a group like his. 

124 



the natu»e of a scofe 


125 


TablB 6 . 1 . Two 2O.W01CI Spelling Tests 
Test A 


bar 

cat 

form 

jar 

nap 

dish 

(at 

sack 

rich 

sit 

feet 

act 

rate 

inch 

rent 

lip 

air 

rint 

must 

red 


baroque 

catarrh 

formaldehyde 

jardiniere 

naphtha 

discernible 

fatiguing 

sacrilegious 

ricochet 

citrus 

feasible 

accommodation 

inaugurate 
insignia 
dcicrrenl 
eucalyptus 
questionnaire 
rhythm 
ignoramus 
accrued 


^ T Ivvhai wore the =6““= 

" “ -- 

We need some broader, psycholosical 

of reference if we arc ^ 5^p. 

urements. ^^^15 A and B 1 f°" 1 a. 40-word test 

Le, ns take a look ■ r„m^eeond drroeEh 

pose, now, drat w ^ pils in ear* ^a ihe 

and to give that test t P P „ould soon 
twelfth, NVbat would we b^ „,^body would get 
second or third Stade a m ^ Iain in spelli »5 ability 

rt eeon ■„f2o”'ofrprr.ic'^r.at test. ha. 

rhnprov11»mrscor=6n0.ooneof20o 



126 


NORMS AND UNITS FOR MEASUREMENT 

to improve from 20 up to a score of 30 represents quite a respectable 
accomplishment. The two 10-point gains don’t begin to be equal. 
The units on our scale of scores cannot be considered equal units, then. 
We have a rubber yardstick that has been stretched out at some points 
and squeezed in at others. 

There is one further point that we should make about our spelling 
scores. Let us consider test B, since the point will be most clearly 
and obviously true in this case. A person who fails to get any of the 
items right on test B cannot be said to fall at an absolute zero of spell- 
ing ability. Actually, he may be able to spell hundreds, possibly thou- 
sands, of words. So a person w'ho gets 10 words right on test B doesn t 
demonstrate twice as much spelling ability as a person who gets only 5 
right. On this test, as in an iceberg, the great bulk of what we are 
examining lies below "sea level" and can’t be seen. Wc cannot guar- 
antee that even test A gets down to a true zero point. In fact, it 
would be hard to say what a real zero point is in spelling ability. 

THE NEED FOR NORMS 

We must look, then, for some better type of unit in which to express 
test results than a raw count of units of score or a crude percentage 
of the possible score. We would like the units to have these properties: 

1. Uniform meaning from test to test, so that a basis of comparison 
is provided through which wc may compare different tests— c.g., dif- 
ferent reading tests, a reading test with an arithmetic test, or an 
achievement test with a scholastic aptitude test. 

2. Units of uniform size, so that a gain of 10 points on one part of 
the scale signifies the same thing as a gain of 10 points on any other 
part of the scale. 

3. A true zero point of "just none of the quality in question, so that 
wc can legitimately think of "twice as much as” or "two-thirds as 
much as.” 

The different types of norms that have been developed for tests 
represent marked progress toward the first two of the above objec- 
tives. The third can probably never be reached for the traits with 
which psychological and educational measurement is concerned. We 
can pul five 1 -pound prints of butter on one side of a pair of scales, 
Md they will balance the contents of a 5-pound bag of flour poured 
into the other. "No weight" is truly "no weight,” and units of weight 
can be added together. But we don't have that type of zero point or 
that way of adding together in the case of educational and psychologi- 



128 


NORMS AND UNITS FOR MEASUREMENT 


The average height can be determined in the same way for 9-year- 
olds» 10-ycar-o!ds, and each other age group. The values will fall on 
some such cur\’e as that shown in Fig- 6.1. Points for the curve will 
ordinarily be computed only for full-year groups, but the curve is to 
be considered continuous. That is, w'e can estimate points in between 
the year groups by referring to the continuous curve. Thus, in Fig. 6.1 
a height ot 60 inches corresponds to (or is average for) the age 12, 
while 50 inches corresponds to about 7 years and 8 months. 

Wc can refer any height measurement to this scale and find for 
what age it would be average. Each girl’s height can be interpreted 
as being the average height for a ^rl of a particular age. Thus, the 




AGE NORMS 


129 


glri who has a height of 60 inches can be descnbed as being as 
as the average girl of 12 years. U wc also know how old the gitl 
actually is wi can judge whether she is tall, average, or short tor her 
rce Cs if Siry is 55 inches tall and is only 8 years old we know 
Sai shVis 4 tor her age. Her height is average for 

The age framework is a relatively simple and familiar one. He 
. aTa itvear-old" is a common way of descnbing a youngster. 
For f trait that shows continuous and relatively 
period of years, the age framework « ^ Lve a 

:&“rag:;! m„st „ow consider in more 

‘’"•The big issue in using age 

of a year's growth 'f “ 
the growth from age 5 to a^t 6 ^ „p 

age 11, and similarly for see that the year’s gro«ah 

age scale, we soon reach a ^ pojn,_ some time m the 

unit is clearly inappropnatc. The nieas- 

teens or early 20's, when grow ^ ^ slowdown takes place 

ure slows down and *!r’’A vlar-s mowth after 14 seems clearly 
quite abruptly after age 14. A yea ^ ^ jbom 

to be much less than a meaning. The 

14 or 15, the concept of ^ found, varying only "i 

same problem of a , ,or any trait that 

the age at which it occurs and in *roP „„,torm 

measure. The failure of the unit “ ^ estremes ot age, but 

meaning is most apparent as one meaning even m the 

there is no guarantee that this unit 

intermediate range. oMienin" growth curve is most ap 

The problem iutiodueed by the "antm „ g 
pamnt when we consider the ^fS 10 (7° i"4'-«)? 

mat is the height-age of „ wn are to assign ^ 

average woman never gets that extension of out gr 

ape value we must invent some h^ i,ne assumes 

c^rve such as the lightly dotted hne >" ,vas typical up to 

that growth eonUnues at abed sa^ ^ xst^ed 

age 14. On this extrapolated cuts , tn But this is a compic 5 

a*height-age of about 16 years, 6 m th j-otrespond to the 

Lifidal Z arbitrary age ^t correspond » average 

average height ot -^ijes "taller than j fo, ability 

height at uny age. It merelj ^ .,^cnt must be used 
This same type of artificial age eq 



,30 NORMS AND UNITS FOR MEASUREMENT 

.nt to exoress the performance of bright pupils m 

*ome’smlner and gradually disappear f 

may be no more than that from age 11 to 2, ’ ° J, 

be Uttle or no further rise. Thus, a mental age °f « °^0 d°es^ 
mean performance corresponding to that of an avera^ y 
20-year-old. It is an arbitrary extension of the sr * 

Such arbitrary and artificial age values arc required if %vc are to 
able to describe the performance of the upper half of our teen-ag 

“'^b'is also^true that growth curves are not entirely 
ferent functions. Rate of growth and time of reaehtng a maximum 
differ substantially. How shall we compare age scores ™ 
lary test and a maze-tracing test, for example, if the first 
rise up to and into the twenties, while the second reaches R 
in the early teens? For a 10-year-old to have reached the 
level may represent appreciably different degrees of supenonty 
different traits. . j.. 

Two years’ acceleration may also have quite different meaning, 
pending on the age level at which it occurs. A S-year^ld who is^a 
tall as the 7-year norm is much more outstanding than the 10-year 
who reaches the 12-year norm. This fact has led to the 
of the intelligence quotient and other types of quotients (which w^ 
shall consider presently) to allow for differences in age of the exami- 
nees. But the basic difficulty of inequality of the age unit at different 
points in the age scale still remains. u d . 

Of course, age norms are primarily appropriate for traits that e- 
pend on general normal growth. A trait showing no continuous im- 
provement over an age range (such as acuity of vision) cannot pos 
sibly be expressed in terms of a scale of age units. One that depen 
primarily upon specific educational experiences, such as facility m 
arithmetical operations, seems to be more reasonably related to t e 
educational framework of school grades than to the biological frame 


work of years of growth. 

Fmally, though it does not directly concern the consumer of tests, 
it is wordi noting that from the viewpoint of the test producer age 
norms present some serious practical problems. It is often difficu 
to get together a truly representative sample of individuals of a 
age. Thus, if one wanted a cross-section of 12-year-olds one wou^ 
have to look for some of them in the elementary school and some m 
the junior high school. They would have to be assembled from quite 



GRADE NORMS 

a range of school grades. Then as one moves toward the older ages 
the sample one needs to reach is widely scattered— some in school, 
some at college, some in the military establishment and some m the 
world of work. To reach a representauve sample of 
example, is a veiy forbidding task. Urn is one more reason why te 
usual age norms for tests become suspect as one moves up into 

‘“l"n summary, age norms, which are based on the performance of the 

meaning as a unit in terms of w Jjn,cmary-school years and 

Age norms arc most “PP™?'®" ' , j^elopment of the indi- 

tor abilities that grow as a part of the genet n 

vidual. Physical and ““ 

weight, and dentition, and ’ , „„ is most acceptable. 

gence appear to be ones for which this type oi nom 

GRADE NORMS 

Grade norms have many of *''' ““ae gtmps' instead of age 
ing only in that the reference ^ — motive groups in each of a 

groups That is, a test is given |etermined for each 

series of school grades, and “™'^',n:“s„ec=ssive grades are 
grade. Scores lying The standard terminology 

assigned fractional credits by '"“'1 ^ ^e at the beginning of the 

assigns the value 5.0 to hverng middle of the grade, and 

fifth grade, 5.5 to average P'*™’" ' ms tor the reading test 

■ so foAh. A representauve table of m Table 6.3, 

of the Metropolitan Achievem ^ corresponds to the pet 

p. 132. Thus, in this table a ra ^ cf ,|,e third grade, a taw 

Lrmance of the average chdd h< ® .jc, while 12 is average 
score of 15 is average for begtnmng fourth gr 
for the middle of grade three limitations as age 

Grade norms have fot”o"''«'f 'J one grade ts the same 

particular, we have no f equality is g 

amount of growth at nil grade^els. educational gams depend 

peer in the case "““;Snstnictio"- in 

upon the content and tj, sense tor those subject are 

units to express growth o" bt ^.e school program. Since 

which instruction is conunuo 



NORMS AND UNITS FOR MEASUREMENT 


Table 6.3. Grade Equivalents of Raw Scores 

for Reading 

Melropollfon Achievement Tests— Intermediate Battery, Foi 

Raw 

Grade 

Raw 

Grade 

Score 

Equiv. 

Score 

Equiv. 

44 

12.5 

22 

5.3 

43 

12.2 

iT 

5.1 

42 

11.8 

20 

4.9 

41 

11.6 

19 

4.7 

40 

11.2 

18 

4.5 

39 

10.8 

17 

4.4 

3S 

10.3 

16 

4.2 

37 

9.7 

15 

4.0 

36 

9.2 

14 

3.8 

35 

8.7 

13 

3.7 

34 

8.4 

12 

3J 

33 

8.0 

11 

3.3 

32 

7.7 

10 

3.1 

31 

7.3 

9 

3.0 

30 

7.1 

8 

2.8 

29 

6.8 

7 

2.6 

28 

6.6 

6 

2J 

27 

6.3 

5 

2.3 

26 

6.1 

4 

2.0 

25 

5.9 

3 

1.8 

24 

5.7 

2 

1.6 

23 

5.5 

1 



Reproduced by permbsion of Harcourt, Brace and World, Inc. 

instruction in most of the basic skill subjects tapers off during high 
school, grade norms above the eighth or ninth have little direct mean- 
ing. In most cases, these arc extrapolated values similar to those for 
the upper ages of age norms. Of course, grade norms for most high- 
school subjects would be essentially meaningless, since these are taught 
in only one or two grades. 

The slowing down of gains at the upper grade levels makes it very 
difhcult to express the performance of a very able chDd in terms of the 
grade framework. Many a superior child in the seventh or eighth 
grade can only be designated Il-J- in terms of grade norms for stand- 
ard school subjects. That is, his performance surpasses that of the 
average child in the highest grade for which norms are raeaningfuL 
A further caution must introduced with respect to the interpre- 
tation of grade norms. Consider a bright and educationally advanced 
child in the third grade. Supple we find that on a standardized arith- 



PERCENTILE NORMS 


133 


mrtic test he Kts a score tor which the grade equivalent is 5.9, Wis 
‘tud“ Std of fhe hfth grade, hat this .gta -= 

r■::^LrcMd^ssr« 

score points (and consequently a g 

earned merely ‘hiia has a grade eqaivaleat of SS 

reraerabetmg. me fact tha 

:;l"°«Ts'o\\ the rehceln of a scL and does not teU in what way 

on the administrative 8"’“!’^ /''? achievement, the concept 

zation. In the directly on ban age level. Itis 

of grade level is perhaps a "'»« performance in these 

in relation to his grade * j Oatside of the school set- 

areas is likely to be used and tnterpretea. 

ting, grade norms have little '"'“"'"S, ^ perfoitnance ol an 

To%umtnarlze, grade norms, de level, are useful 

individual to that ot the iuterpreting the academic ac- 

prtmarily in providing a ''“'reUm^ rry school. For this purpose 
complishment of children to he =lem«t=J „„ ,po„|h we canno 
they are relatively “"""'"'/"luy of g?ade units. They have httle 
place great confidence tn the equaiuy oi E 

Se for other types of groups or measutes. 

percentile norms 

r cc tind crade norms we give 
we have just seen that iu “ImrSniS the "S' ” 8rude group 
meaning to an individual's score by more sens 

To Whief he would be iust asmragn a group of whmh 

irSa;rr5''how « 

the median, quartiles, “"J* P, (ailing below p„p o„ 

Table 6.4 shows percentile 








PERCENTIIE NORMS 


135 

PercentiJe norms are very widely adaptable and applicable. They 
can be used wherever an appropriate normative group can be obtained 
to serve as a yardstick. They are appropriate for young or old, for 
educational or industrial situations. To surpass 90 per cent of the 
reference comparison group signifies a comparable degree of excel- 
lence whether the function being measured is how rapidly one can 
solve simultaneous equations or bow far one can spit. Percentile 
norms arc widely used. Were it not for the two points that ws must 
now consider, they would provide a very nearly ideal framework for 
interpreting test scores. 

The first problem that faces us in the case of percentile norms is 
that of the norming group. On what type of group should the norms 
be based? Qearly, we will need dilTerent norm groups for different 
ages and grades in our population. A 9-year-old must be evaluated 
in terms of 9-year-old norms; a sixth grader, in terms of sixth-grade 
norms; an applicant for a job as stock clerk, in terms of stock-clerk- 
applicant norms. The appropriate norm group is in every case the 
group to which the individual belongs and in terms of which bis status 
is to be evaluated. It makes no seme to compare a medical-school 
applicant with norms based on unselected adults. 

If we are to use percentile norms, then, we must have multiple sets 
of norms. We must have norms appropriate for each distinct type of 
group or situation with which our test is to be used. This is recog- 
nized by the better test pubiishers, who provide norms not only for dif- 
ferent age or grade groups but also for special types of educational or 
occupational populations. However, (here are limits to the number 
of distinct groups for which a test publisher can produce norms. 

Published percentile norms will often need to be supplemented by 
the test user, who can build up norm groups particularly suited to his 
individual needs. Thus, a given school system will often find it valu- 
able to develop local percentile norms for its own pupils. This will 
permit interpretation of individual scores in terms of the local group, 
a comparison that may be more significant for local problems than 
comparison with the national norms. Again, an employer who uses 
a test with a particular category of job applicants may well find it 
useful to prepare norms for this particular group of people. Evaluat- 
ing a new applicant will be much facilitated by these strictly local 
norms. 

The second problem In relation to percentile norms is more serious. 
Again \vc arc faced by the problem of equality of units. Can we 
think of 5 percentile points as representing the same amount through- 
out the percentile scale? Is the difference between the 50ih and 55th 



134 NORMS AND UNITS FOR MEASUREMENT 

percentile equivalent to the difference between the 90th and 95th? 
To answer this, we must notice the way in which test scores for a group 
of individuals usually pile up. \Vc saw one histogram of scores in 
Chapter 5 (p. 102). This picture is fairly representative of the way 
the scores fall in many cases. Hicre is a piling up of scores around 
the middle scores and a tailing off at cither end. The ideal model of 
this type of score distribution, which is called the normal curve, was 
also considered in Chapter 5 (pp. 113-115) and is shown in Fig. 6.2. 
The exact normal curve is an idealized mathematical model, but many 
types of tests and measures distribute themselves in a manner that 
approximates a normal curve. You will notice the piling up of most 
of the cases in the middle, the tailing off at both ends, and the sym- 
metrical pattern. 

In Fig. 6.2, four score points have been marked. These are, in 
order, the 50th, 55th, 90ih, and 95lh percentiles. Note that near the 
median the 5 per cent of cases (the 5 per cent lying between the 50lh 
and 55th percentile) fall in a tall narrow pile. Toward the tail of 
the distribution the 5 per cent of cases (the 5 per cent between the 
90th and 95lh percentile) make a relatively broad low bar. Five per 
cent of the cases spread out over a considerably wider range of scores 
in the second case than in the first The same number of percentile 
points corresponds to about three times as many score points when 
we are around the 90th to 95th percentile as when we are near the 
median. The further out on the tail we go, the more extreme the situ- 
ation becomes. 

Thus, percentile units are typically and systematically unequal. The 
difference between being first or second in a group of 100 is many 
times as great as the difference between being 50ih and 51sl. Equal 
percentile differences do not represent equal differences in amount. 
Any interpretation of percentile ranks must take into account the fact 



Fig. 6.2. Normal curve, chowing jetected pereenHU pc/ntt. 



STANDARD SCORES 

that such a scale has been pulled out at both ends and squected ta the 
middle. Maty, who falls at the 50th percentile >” “"I , ' 

If the percentile is to be meaningful, the group musi 
which it is reasonable 

usually need a number of ° *^'15 , ages grades, or occupa- 

groups, if we ate to use a test with diSeren g , b 

tions. As long as i„tfrpretation of percentile 

type of norm is widely “PP'“ ^ ,u , „e have a systematically 
values is made more diflim ‘ J' ' , j ,he middle range and large at 
“rubber” scale whose units are small in me lu 

the extremes. 

standard scores 

Because the units f' ^^“omTote? unh that 
clearly not equal, h° I„le range of values. Slmdard- 

the same meaning throughou purpose, 

score scales have been developed <0 rcwm P ^ 

In Chapler 5, we became a, scores. The standard 

a measure of the spread or ® jvialions of scores away from 

deviation was a type of hve« 8 e Scores may be ex- 

the mean-the fc mean. Thus, if the 

pressed in standard y^ ’^ ,a„jjrd deviation is 15. a score 

Tbemeausandstaud 

ard'Som ilow, as are the scores 


by Johnny and Mary. 


Mean . ^ 

Standard deviation 
Johnny's score 
Mary’s score 


Test A 
65 
15 
77 
87 


TesiB 

40 

10 

55 


“’"''Vruse standard scorcs to compare petformances 
Let us sec how we individuals. 

oV the two tests or 0 



138 NORMS AND UNITS FOR MEASUREMENT 

On test A, Johnny is 12 points above the mean, or 12/15 = 0,8 
standard deviations above the mean. On test B he is 15 points, or 
15/10 = 1.5 standard deviations above the mean. Thus, Johnny does 
a good deal better on test B than on test A. For Mary, the corre- 
sponding calculations give 


87 - 65 

Test A: 1.5 


48 - 40 ^ „ 

TestB: * 0.8 


Thus, we may say that Mary did as well on lest A as Johnny did on 
test B, and vice versa. Each pupil’s level of excellence is expressed 
as so many standard deviation units above or below the mean of the 
comparison group. This is a standard unit of measure having essen- 
tially the same meaning from one test to another. For aid in inter- 
preting the degree of excellence represented by a standard score, see 
Tables.? (p. 115). 

The type of score in standard deviation units that wc have Just pre- 
sented is satisfactory except for two mailers of convenience: (1) it 
requires us to use plus and minus signs which may be miscopied or 
overlooked, and (2) it gets us involved with decimal points which 
may be misplaced. We can get rid of the need to use decimal points 
by multiplying every standard deviation score by some constant, such 
as 20. We can get rid of minus signs by adding Co every score a con- 
venient constant amount such as 50. Thus, for Johnny’s scores on 
test A and lest B, we have 



Test A 

TestB 

Mean of distribution of scores 

65 

40 

Standard deviation of distribution 

15 

10 

Johnny's raw score 

77 

55 

Johnny's score in standard deviation units 

+0.8 

+1.5 

Standard deviation score X 10 

+8 

+15 

Plus a constant amount (50) 

58 

65 


A table of standard scores (or test A, based on this conversion, in 
which the mean is set equal to 50 and the standard deviation to 10, is 
shown in Table 6.5. 

We could have used values other than 50 and 10 in setting up our 
conversion into convenient standard scores. The Army has used a 
standard-score scale with mean of 100 and standard deviation of 20 
for reporting its test results. The College Entrance Examination 
Board has long used a scale with meau of 500 and standard deviation 
of 100. The Navy has used the 50 and 10 system. 

Originally used in the Air Forte, stanine scores have had some 



STANDARD SCOrK 

Toble 6.5. Standard^re EquJvoIents for Test A 


(Standard score n 

Raw Standard Raw 

Score Score Score 


(20 

87 

80 

115 

83 

75 

110 

80 

70 

lOJ 

77 

65 

100 

73 

60 

95 

70 

55 

90 

67 

50 

85 

63 

45 


50. SD = 10) 


Standard 

Score 

Raw 

Score 

Standard 

Score 

60 

40 

33 

57 

35 

30 

53 

30 

27 

50 

25 

23 

47 

20 

20 

43 

15 

17 

40 

10 

13 

37 

5 

10 


popularity in recent j'cars. These are single-digit standard scores in 
which the mean is 5 and (he standard deviation 2. The relationships 
among a number of the different standard score scales, and the rela- 
tionship of each to percentiles and to the norma) curve arc shown in 
Fig. 6.3. The model of (he normal curve is shown, and beneath it 
are a scale of percentiles and several of the common standard score 
scales. This figure illustrates the equivalence of scores in the dif- 
ferent systems. Thus, a stanine score of 7 corresponds to an Army 
standard score of 120, a Navy standard score of 60, a College Board 
standard score of 600, a percentile rank of 84. The particular choice 
of score scale is arbitrary and a matter of convenience, ft is too bad 
that all testing agencies have not been able to agree upon a common 
score unit. Hon ever, (he important thing Is that the same score scale 
and comparable norming groups be used for all tests in a given organi- 
zation, so that results from different tests may be directly comparable. 

Frequently standard-score scales arc developed via the percentiles 
corresponding to the raw scores. The lest maker assumes that the trait 
he is measuring is basically distributed in accordance with the normal 
curve. If he docs not get a normal distribution of scores in his norm- 
ing group, he assumes that this »s because the raw-score units in which 
his test scores were expressed did not represent equal units throughout 
the range of scores. You will remember out discussion of this point in 
connection with our spelling test (pp. 125-126). He therefore takes 
steps to ma/ie his distribution of standard scores norma! — he normal- 
izes it. The actual calculations make use of percentiles and of tables 
of the normal curve. We shall not illustrate the details of procedure 
here. 




These standard scores have Ihe distinctive feature that they are 
guaranteed to have a normal distribution, at least for a population 
comparable to that on which the ori^al norms were obtained. The 
score scale been stretched in some places and squeezed together 
in others so that finally a normal distribution results. This process 
of stretching and squeezing can take care of any inequality in the 
original units at different raw-score levels in the test. If the basic 




INTERCHANGEABIllIY OE DlFFEBENt TYPES OF NORMS t<l 

assumption of a normal distribuUon vras joslilicd, this transformation 
will produce a scale i" 

same amount at any point on the ^ . h literature 

;:'.:Xe:s:rrr;:;st;~-s.aLrd score 

tion of the individual s percentiles in that they are 

licular reference group. Th y . ^ standard 

been used by different testing agencies. 



Wiiehivt' ‘a. '‘ rihl ’ Iw ^ 

by the test publisher. *™oonding score equivalents in fte 

on the test, together with the “^P° yi|,„s provide tables giving 
system of norms being used. P “ jj given In 

riore than one type of ^'“"^Xwor Ae U.’’S<me SUm Test o 
Table 6.6. Here we see .be ^rms 'or ® Four types of 

the revised MerropoKw- [^j^d on a group tested early 

norms are shown. The P««"; assigns a mean of 50 

in the sixth grade. The = *^“'',.ris,b.grad= group, -nius, a 

roy:irf:t“nhecharac.e^^^ 

3. Receiving a standard srore 

4. Receiving a stanme o - .iff„ent systems of norms 

_ y ble 6.6 it is easy to see that the translate 

^ , wavs of expressing the same thing. ^ 

diBerent purposes. 



142 


NORMS AND UNITS FOR MEASUREMENT 


Table 6,6. Norms for Longuage Study SWilli Test cf Mctropolilon 
Acblevemenf Tejts—lntermedlote Batlcryi. Form A 


Grades Grade 6 

Percentile Sianinc 


Raw 

Score 

Standard 

Score 

Grade Rank (October 
Equiv. testing) 

(October 

testing) 

28 

80 

12.5 



27 

77 

!2.0 

99- 


26 

73 

11.6 

99 

9 

25 

70 

II.l 

98 


24 

67 

10.6 

95 


23 

65 

10.1 

93 

8 

22 

62 

9.4 

88 


21 

60 

«.6 

85 

7 

20 

58 

8.0 

80 


19 

56 

7.4 

70 


18 

54 

7.0 

65 

6 

17 

52 

6.6 

60 


16 

50 

6.2 

50 


IS 

49 

5.9 

45 

5 

14 

47 

5.6 

40 


13 

45 

5.3 

30 


12 

43 

5.0 

25 

4 

11 

42 

4.8 

20 


10 

40 

4.5 

17 

3 

9 

38 

4.3 

13 


8 

36 

4.0 

10 


7 

34 

3.8 

8 

7 

6 

32 

3.6 

5 


5 

30 

3.3 

2 


4 

28 

3.1 

1 


3 

24 

2.9 

1 

1 

2 

20 

2.6 

I- 


1 

16 

2.4 



Reproduced by permission of Harcourt, Brace and World. Inc. 


However, the different norm q.'stems are not eniireiy consistent as 
we shift from one t>'pe of test to another. This is due to the fact that 
some functions mature more rapidly from one year to the next, rela- 
tive to the spread of scores at a given age or grade level. 

This can be seen most dramatically by comparing reading compre- 



imEBCHANGEABlllTY OF DIFFERENT TYPES OF NORMS I« 

Paragraph Mirmins ArilhmeTic CompaohoR 

„ 11 ‘»1 29 2S 

Raw score ,, 7 4 5.2 

Grade equivalent •*• 5^ 92 90 

Grade 5.2 percentile 50 — 

Achievenrent Tea. BaUepy B — « 
tasted at the end of 2 months in ">= «'■'' „.ai 5.2 

on both tests that were just a'Mg^ “n,® tested after 2 months in 
and he was at the 50th f ^nnance. bet hosv does he 

the fifth grade. Henry shows s,V« F ^ 

compare in the two subjects. . his grade plaeement. But 

well in both; be is just one full “ j„ arithmetic than in readmE. 
in terms of percentiles he ts rouA tener^^^^ „„ ,hc 

l.e., 92nd percentile as compmc * ^„,h „3ding and arilh- 

other hand tails at iu« the Is ,.4 and for 

raetic. In his case, his grad, eq 

arithmetic is 6.1. ^hovc esampic arc due to i • 

The discrepancies that „ and rate of growth of reading 

fercnces in ''’’"“*“''‘r!h‘::^rwS^ spread within a single gmde 

and arithmetic. to erode. Some fifth graders 

group, relative to the ehance grader, so a grade equtva- 

read better than the average ^ graders. In fact a gra 

lont of 8 or 9 is not ^^ntUe for pupils a. gra J 

equivalent of 8.0 '“"fP'’"* “J|„„,t never docs as well .n an hmeltc 
5 2 By contrast, a fifth gra because he his not cnco 

an efghth or ninth g-f 'V" be presented in the ^ 

or been taught many ^ ^ jbus. fifth graders arc more ‘ 
MTih seventh, and eighth grades, tnu • 

;:„=;us with fifth to eighth grade than doe. 

.rtihmetie shows more raptd 

dllfercnce may result, m 



NORMS AND UNITS FOR MEASUREMENT 

in the gFOViTh functions for the subjects and need not mean a 
uneven pattern of progress for the child. 


genuinely 


quotients 

In the early days of mental testing, after age norms '’“d be=" “sed 
for a Eew years, the need was fell to convert the age fjore into an index 

that would express rate of progress. but how 

equivalent of lO^A years was obviously better than , 

ilch better? Some index was needed to 

age (actual time lived) as well as the age equivalent on the test (sco 

'"''■nie expedient was hit upon of dividing test age by chronologiral 
ave to yield a quotient. This procedure was applied most ^ 

Jilh tests of intelligence where the age equivalent we were concern 
with was a mental age and the corresponding quotient was an mie ' 
gence quotient. However, it was also used to some extent for acme 
ment tests and for some other sorts of measures. 

The formula lor computing the intelligence quotient in this way is 
given below and is 'illustrated for the 8-year-old who reaches the lu/,- 
year level on the test. 


/( 2 - 


lOOMA 

CA 

100(10.5) 

8 


131 


A similar quotient could be computed for a reading test, general 
achievement battery, measure of strength, or any other testing instru- 
ment that yields age norms. The resulting value would be calle a 
reading quotient (RQ), educational quotient (EQ), or the like. 

How does an intelligence quotient come to have meaning? In the 
first place, it is obvious by the way in which the quotient was estab- 
lished that 100 should be average at every age group, since the average 
10-year-old, for example, should fall exactly at the 10-year level on 
any test if the age equivalents were properly established. But how out- 
standingly good is 125? How poor is 80? Such questions as these 
can only be answered by becoming acquainted with the distribution 
of quotients that a particular test fields. 

The intelligence quotient was ori^nally developed in connection 
with the individual intelligence test of the type represented by the 



QUOTIENTS 

Slanlard-Binet (set Chapter 9). A typieal distribution 

quotients for the 1937 revision of that test, based upon standard. 

Sr^onp. is shown in Table 6.7. This table shows the per cent ol 

Table 6.7. Distribution ol Revised Stonlord-Binet IQ's 



Per Cent 

Cumulative 

IQ Range 

of Cases 

Per Cent 

140 and over 

1.3 

3.1 

8.2 
18.1 

23.5 
23.0 

14.5 

5.6 

20 

99.9 

98 6 

130-139 

95.5 

120-129 

87.3 

110-119 

69 2 

100-109 

45.7 

90- 99 

22.7 

80- 89 

82 

70- 79 

2.6 

60- 69 

Below 60 

a6 

0.6 


From L. M. Terman aod 

scale, Bosion, Houghton MifR'" ^ • 

• . \f\ interval and ihe cumulative per- 
cases falling svithln each 1^°“ ° | 3 pj, cent of cases got IQs 
centage through each ,30 to 139. and so forth. An 10 

of 140 and over. 3.1 per «" of the group (fall at the 

of 125 would surpass j gj ,vouId surpass only about 8 per 

L'rr^::h-"Itdlstrlhu,loaof.Q'sisl01.5,aod*^ 

tain the saase ovenso a"d interpretation was appiopnate 

deviation of approx 


,44 NORMS AND UNITS FOR MEASUREMENT 

dren. This relationship of quotients to standard scores explicitly 
recognized in most recent intelligence tests. For these, tables of lU 
equivalents have been set up at each age level. These have been bui t 
so as to give a common mean and standard deviation for all age groups. 

As a matter of fact, the most recent edition of the Stanford-Binet, 
brought out in I960, also uses standard scores designed so that the 
mean is 100 and the standard deviation 16 at each age level, rather 
than the MA/CA ratio that was the basis for the IQ in earlier editions. 

The quotients yielded by different tests are, unfortunately, not ex- 
actly equivalent. A variety of factors in the test and in the selection 
of norming groups have led to somewhat different means and standard 
deviations of intelligence quotients. Some evidence on the variability 
of quotients for five widely used tests for high-school groups is pre- 
sented in Table 6.8. Experience with a test in a particular community 


Table 6.8. Equivalent IQ's on Five Widely Used Group Intelligence 
Tests (From Engelhart 


Otis Quick* 
Scoring Beta, 
Form F„i 

California 
Mental Ma* 
liiriiy, Short 
Form, Inter* 
mediate 
Form S 

Ktihhtan* 
Anderson 
Battery 
Booklet G 

Lorge* 
Thorndike 
Verbal, 
Level 4, 
Form A 

Fintner 

General 

Ability, 

Intermediate 

Form A 

140 

14S 

140 

142 

151 

130 

134 

130 

132 

139 

120 

123 

121 

121 

126 

110 

113 

111 

111 

113 

100 

102 

101 

100 

100 

90 

92 

92 

90 

87 

80 

81 

82 

79 

74 

70 

70 

73 

69 

61 


setting will provide a further basis for interpreting quotients at dif- 
ferent levels. 

The notion of the intelligence quotient or IQ is deeply imbedded in 
the history of the testing movement, and, in fact, in twentieth-century 
American culture. The expression “IQ test" is a part of our common 
speech. We are probably stuck with the term. But in the future 
IQ's will in most cases really be standard scores. And this is how 
wc should think of them and use them. We may hope that eventually 
the test publishers will agree upon a common standard score scale and 



PROXIES 

(viU Mlablish more clearly comparable normalive groups, so thal scores 
on difierent tests will be more directly comparable. 

PROFlies 

The various types of norms we have been considering provide a 
means of expressing scores on quite different tests in common units in 
such a way that they can be directly compared. There is no direct way 
of comparing a score of thirty words correctly spelled with one of 
twenty arithmetic problems solved. But if both scores arc expressed 
in terms of the grade level to which they correspond or in terms of 
the per cent of some defined common group that gets scores below 
that point, then they may be compared. The set of different test scores 
for an individual, expressed in a common unit of measure, constitute 
his score profile. The separate scores may be presented for compari- 
son in tabular form by listing the converted score values. Illustrations 
of record forms shoiving the manner of recording comerted scores arc 
given in Figs. 6.4 and 6.5. The comparison of different subareas of 
performance is made piciorially clearer by a graphic profile. Several 
ways of plotting profiles arc shown in Figs. 6.6, 6.7, and 6.S. 

Figure 6.6 shows (he form for plotting the subscores of the Calijor' 
nip Test of Mental Ability. Each subiesi is represented by a row. 
The scale of age equivalents appears across the top of the form. The 
broken vertical line portrays the performance of the particular indi- 
vidual. Peaks in performance are to the right and low points to the left. 

Figure 6.7 shows a similar form for plotting part scores on the 
Metropolitan Achievement Test. This fonn differs in represeming the 
different tests in successive columns and presenting the score scale in 
the vertical dimension. Grade equivalents arc shown on the vertical 
scale. 

Figure 6.8 shows a type of profile chart for the component tests of 
the Differential Apiiiiide Test Battery. This batteiy- undertakes to ap- 
praise different aspects of ability important in a high-school guidance 
program. Note that in this case the different tests arc represented by 
separate bars, raihcr (han points connected by a line. The sale used 
in this case is a percentile scale, but in pfoitlng pcrccniilc values .ap- 
propriate adiusirocnis have been m.idc for the inequality of percentile 
units. That is, percentile points have been spaced in the same way s% 
(hey arc in a normal curve, being more widely spaced at the upper and 
lower extremes than in the middle range. This pcrccmitc scale cor- 
responds ID the percentile scale that is shown in Figure 6.3 (p, NO). 

By this process, the pcfrcntile values for an individual are plotted on 






fig. 6.4. CIqu rteord form (or M*lrop«lilor» AcMovomenf Totf. (Scores recorded os grode equivotents.) 


California Test of Mental Maturity • Qass Record Sheet 






norms and units for measurement 



•mJWUuol pupil. (RuproJu'"! by p..ml..li.n pI Colilo. 


PROFILES 


151 



an nqual-unll scale. A g.ven whether it lies hig^ 'O''. 

of as rcprescnling the satne =■"<=“"> “' “‘^7, ftc same distance 
or near the middle of '"'ff. another, 

can be considered equivalent fro j ^ down from 

Note that in Fig. 6.8 'h' ,he average of the group 
the 50th percentile. Fft.'^rlhe scale and individual scores can be 
constitutes the anchor point ,^'’„rr.gure brings out the indi- 

referred to this base ie ■ dramatically. . , 

vidual’s strengths and weataessesveo d^.^^ m 
The profile chart makes a roBIes, however, sereial 

scores lor an individual. In > tP - procedures 

cautions must be borne rn ■"‘"‘'•J;;,™ the several tests are com- 
plotUng profiles assume t''t“ scores must be based upon equ. 
parablm Age, grade, or arantee of '7' 

Lt groups for all the dl tests. This is fq 

course, a common P°P"'7rihe diSmnt subtests of a '«• “S 
that commonly prevails ^ j^rrie time on the bas - 

Norms for all are ^Xmparability »' ,■>“= 

a common group. The gim j„r,s, attractive 

the different component tests is one plotted ro- 

an integrated battery. « ttept”'"!' ■ 




Rg. 6^. PupU preW* cKort for Di(fer«n«iol Aplhode TciIl. (Reproducrc! by pertniaion 
cf tf)« PTychotegksI C«rporot>en.) 


geiher, we can usually only hope that the groups on which norms were 
established were comparable and that the profile is an unbiased picture 
of relative ach^e^•eme^t in the different fields. WTiere it is necessary' 
to use tests from several different sources, one solution is to develop 
our own local norms on a common population and to plot individual 
profiles in terms of the local norms. 

A second problem is that of deciding how to interpret the ups and 
downs of a profile. Not all the differences that appear in a profile are 
meaningful, cither in a statistical ora practical sense. We must decide 
which of the differences deser\-c some attention on our part and which 
should be ignored. This problem arises because no test score is 
completely exact. A full discussion of the problem of reliability and 
of the “error of measurement” in a test score will be provided in the 
following chapter. At this point, we shall merely note that test scores 
arc not perfectly accurate, that performance on a reading test or an 





USING NORMS 


1S3 


aptitude test will vary somewhat from form to form and from occasion 
to occasion. Thus, small differences from one score to another in a 
test profile should be largely ignored as having very probably arisen 
by chance. Only as the differences betwrecn «:ores become substantial 
in relation to the standard error (see p. 175) of the separate scores is 
there any justification for interpreting the differences as representing 
something real and significant. 

Organizing the separate test scores of an individual into a graphic 
profile is. then, a very effective way of dramatizing the high and low 
points in a score pattern. Such a profile may be plotted ahenever 
scores from several different tesls are expressed in ihe same nnils. 
However, a profile must be interpreted with a good deal of caution, 
because even unreliable differences may look quite impressive. 

USING NORMS 

\Vc have seen that norms provide a basts lor '"'"P''"”® 
of an individual. Converting the score for any lest 
age or grade equivalent, percentile or standard score, p 
pLatio^ of the level a, which the individual - 
particular test. Bringing together the set o “ l„ j 

in a common unit of measure, and perhaps c individual 

profile, brings out the relative level oi performance of the 

in different areas. . „ school, or 

The median performance for a class, “ t,,, similarly 

the children in a grade throughout a sc ^ within the 

reported. We then see the average Icve ^ ^^ne of the group 

group on some single function or the rc ^ ^ „i,i,i„ which the 

in each of several areas. Norms provi picture inlo the corn- 

picture may be viewed and bring all parts o ^ jjjpuld 

mon frame. Now what docs the picture me , 
do about it? -jp ^ ready-made intcr- 

Obviously we cannot, in a few pages, pr jj, g practical 

pretation for each set of scores that may -cncral guiding 

testing situation. However, we can lay o'* unwise interpretations 
and principles that may help to forest^ ,},c inter- 

of test results. The first points nit points rebimE 

pretation of group results, HoV'C'’rt- . 

primarily to interpretation of individua jj ,5 pther type ^ 

overlap somewhat, and each has some re 

situation. 



,54 NORMS AND UNITS FOR MEASUREMENT 

PR/NCIPIES guiding INTERPPETATION Of GROUP PEEfORIMNCE 
M E..U.a,:ns A.era.e Oro.., AcMere.en, 

Be Given To Average Ability Uvel m the do 

wth an averaae mental age of 10 years could not be exp.cte 
aStmetic as well as one with an average mental age of 12 y^rs- 
Some adjustment must be made for the tj-pical abihty ’ 

one must be somewhat conservative in making \ 

dally for classes superior on an intcUigence test and a 

between inteUigence and academic achievement is no , 

group of briaht youngsters will rarely be comparably 
fchievement. This will be true particularly in the 
and less academic subjects, such ^ spelling or handwriting. 
that deviates from average in ability can be expected to di ^ 
general norm in achievement also, and in the same direction, 
should not be expected to differ as much in achievement as it 


^^2/^4 Further Factor That May Be Expeaed To Influence 
meat Is the Type of Cultural Background from Which the 
Come. Home and community influences are strong. Foreign-a 
guage background, absence of pictures and books in the home, a nega- 
tive family attitude toward schools and schooling may all be importan . 
In a measure, these factors affect intelligence lest score. But cy 
affect achievement also, and perhaps more directly. Where a class 
is atj-pical in cultural background, either especially favored or espe 
cially deprived, allowance must be made for this in interpreting tes 
results. ^ 

3. Croup Achievement Can Only Be Evaluated in the Ughi of Cur 
ricuiar Content, Emphases, and Objectives. If a school system has de 
layed all formal instruction in arithmetic until the third grade in order 
to provide more time in the earlier grades for group projects, soci 
experiences, and preparatory materials, it is unreasonable to expect 
the children in the third grades of that sj-stem to come up to national 
third-grade norms in arithmetic. If a school system has de-emphasized 
accurate spelling as an objective, has cut down or eliminated spelling 
drills, and has concentrated on other educational outcomes, it is inap- 
propriate to evaluate that school by riad application of national norms 
in a standardized spelling IcsL 'There is a good deal of es'idence from 
test results themselves that schools in the more prosperous and pnv- 
ileged communities have de-emphasized the basic tool skills of arith- 
metic and spelling in the early grades. In these grades such communi- 
ties often do no better in computation and spelling than much poorer 
communities with children of lower intelligence. 



USING NORMS 

or course fte 

spelling in order to achieve other yjCihi:! they are can only be 
may not actually be achieving ■ obiectivcs as ability 

ansLred as svc develop nreasures to appm.se such ob,em^^_^^ 

to follow directions, to work , „, 3 [io„ 5 t,ips, which arc 

along with other children, tir to gtos of these communi- 

objectives given emphasis in the sm ed obf^tes of .W 
ties. Instruments for “PP™''’”® ,ho schools themselves, 

attention of the measur^ent sp curricular empha- 

But one thing is cleat. The school 1 jj^ojjtdijcd test results, 
ses must be taken into account in in ^ p„„,n,f. One 

4. Use ol Test Remits ^ „„ achievement tests 

continually encounters *"““1““ ^ professional worth of teachers, 

ate used as a basis for ®/,ha teacher's head, a recur- 

The test then becomes a sword he oaajer 

ring threat to his security. '“t^aSTes in order to "beat the tesf 
it the test is resented, if the The teacher is now 

or even gives illicit help at the time of «tmg 
on the side of the pupils workring agains th 

This type of situation is to be "„a| and will disappear i 

In large measure out of 'I? a jirtools to help both pupil and 

administrators see the tests f „n in the fall, when they 

teacher. This will be facilitated if te ts ar 6^ .^an in die 

„,NC,PteS CUIOING INtesSSEtAriON 

level is not a reeding „( Wm. Too many remcihal 

", 

to be 05 superior mac 'C j^pends upon exposure- . j 

S“"T;eehndwho^l.^P>ck^X^^^^ 

to be somewhat to 



,56 measurement 

F™«ru"rf'cu!“ ''■'^‘= i™on.iv°" io 

Lnal skills and accomplishments, and allowance for Q 
will only in part take account of these factors. in 

3. The MMnal Child's Perlormance, Too. Must ^ 

Terms of the Curriculum To Which He Has Been Exposed The ■ 
dividual pupils cannot be expected to progress as ^ 
areas in which teaching emphasis is less. Furthermore, n th 
that arc closely dependent upon instruction, even the able pup 
not be expected to move ahead at a tempo much faster th 
which the material is presented. Thus, the bright child niay 
peeled to be more advanced in word knowledge and reading « 
which he can readdy pick up on his own, than in the processes o a 
raetic, which he is unlikely to master until he has been exposed to tn 


in the school setting. 

4. In the Case of the Single Individual. We Must Be Acutely A 
of the Existence of Errors of Measurement. A test score docs no 
identify the exact level of ability for the child. It represents the most 
likely value within a fairly broad band of possible values. Differences 
between areas of achievement must be viewed as tentative as long as 
these bands overlap. Differences between standing on two testings—* 
say, two reading tests a few months apart— should not excite us undu y 
unless they are quite substantial. We should be rather conservative 
in “explaining” differences that may represent nothing more than t e 
fallibility of our measuring instrument. 

5. In the Very Nature of Things, by the Way Test Norms Are De- 
veloped We Must in General Expect Half of a Group to Fall Belo^^ 
the Norm. The norm is the average, the typical. It is neither the 
ideal of satisfactory accomplishment nor the standard to which we 
can hold everybody, ft is the typical performance of typical indi- 
viduals at the present time. In any average there must be as many 
below as above. Educators must avoid the compulsion to bring every- 
body “up to the norm.” We must be careful not to try to fit everybody 
into the Procrustean bed of the average. 




SUMMARY STATEMENT 


A raw score, taken by itself, has no meaning. It gets meaning only 
by comparison with some reference group or groups. The comparison 
may be with: 



SUCOESIED MJOmONAl EEAOINO '5^ 

1. A scricE or age groups (ape norms). 

‘"r Aiiirf™'?." ~ 

the group mean (standard scores) 

nael, al.eroa.ise ha, eer.am ajvan.apcs and cerrain hmualions, ssh,eh 

we have considered mmtients such as the 

To ge, an index of brightness These 

inlellipenee quotient and '*“l, 1 , 31 c approsimatcly the 

become tneaningful and usable » ,^ 5 y are 

mme stand.ard desiation tor f .'nLeh^t =s sueh. 
essentially standard scores and shou ■■ ^ arc of the same 

„ the norms asetilable for a ‘f .„c can be ev 

iiind and are based on oomparable pr P^^^ ^ p.clnrlally •" 

pressed in comparable terms- T y j, Sciences ssithin the 

the torm of a profile, rrofiles emp over- 

individual. When profiles are used cate mu 
Inlcrpret minor ups and doAns ® ^ ( 0 , imetpretinp the score 

Norms represen. a ‘''''’'T"''' apprepaiion. Hoivevcr, 
of an Individual, a class proup. Jn individual or group 

before a judgment can be made a level, cul- 

ls doing ssell or poorly, f''"”"" ” ^ 3 , 0 , The norm is merely 

lund baelpround, and ^“tr'e“>a- «ph « ,0 m. 

average, not a strait jachet into »hich all ea 

REFEBENCES 

If • //I Stud- C/uW 

-T^ nlnv.c.l growth of girls. ^" 0 -. /«• 

1. Bo>nton. Bernice, Tl*' P > ■ n of growp 

("mtaroEmph'sl t'!"”’ TP- 

SUGOESTEP APPnlOHAl » * 

onSuSL f.e,eMrodia..e— roAStdot, 

Harris, Chester W.. pp. 922-926- e F. Lindqu»'' 

Mosicr. Charles i- “ 



,55 NORMS AND UNITS f08 MEASUREMENT 

Editor, Educational measurement, Washington. D. C., American Co 

on Education. 195!. <cor« Test Sen-ice But' 

Seashore. Harold G., Methods »' 

letin No. 4S, New York, Psychological Corp,. 1955- 

questions for discussion 

nould be needed to interpret this score? Ni-l,.,chool students 

2. SVhy do standardized tests designed for use ssith high-scnooi 

almost never use age or grade norms? ^ county 

3. What limitations Mould national .chool system 

school system in rural West Virginia? What might the local school sys 

t°ma’ assumption or assumptions lie back of the development of age 
norms? Grade norms? Normalmrd standard j whereas 

5. In Fig. 6.8, p. 152, v.hy arc the standard scores csenly spacca wn 

the percentile scores are unevenly spa«d? f^nnuioff entering 

6. Using Tables 63 and 6.6, bncHy charactenze the following cm 

sUth-grade children; 


CA 

MA 

Reading: 

Score 

Study 

Skiib 

12.4 

10.6 

23 

13 

10.5 

13.2 

31 

19 

11.3 

Il.t 

22 

16 


Verbal Reasoning 

18 

Mechanical Reasoning 

Numerical Ability 

23 

Clerical Speed and Acc. 

Abstract Reasoning 

31 

Spelling 

Spatial Relations 

72 

Sentences 


Pupil A 
Pupil B 
Pupil C 

7. You are a guidance counselor and have given the ^^ 0 ‘rentialA^t 
tilde Battery to a ninth grade. Using Table 6.4. prepare a summarj 
and interpretation for a boy with the following scores: 

54 
45 
14 
22 

8. School A gives a battery of achiesement tests each May m each 
grade from the third through the sixth. The median ^de level 
subject in each teacher's class is reported to the superintendent, 
they be reported? If so. what else should be included in the report, 
what v.'ays might a superintendent use the results to advantage? NVhat u 
should he avoid? - 

9. Miss B prides herself that each year she has gotten at least yu p-^ 
cent of her fifth-grade group “up to the norm" in each subject. How 
sirable is this as an educational objective? What limitations or dangers o 
you see in it? 

10. School C operates on a policy of assigning transfer students to 
grade on the basis of their average grade standing on an achievement ba 
lery. Thus, a boy with a grade score of 6.4 on the battery as a who e 
would be assigned to the sixth grade, no matter what his age or his gm ® 



QUESTIONS FOB DISCUSSION ’5’ 

in his previous school. WhsI vnlocs do you see in Ihls plan? TVhat limi- 

lotions? ^ . r .-Kr^!. in citv D notcd that school E fell con- 

S' ll^sriu^rS^hTSSlSS' - - need 10 

^"?rmTa;d‘’’5’cd„cal,onincnyF^^^^^^^ 

grades in Ihcir comrnunily fell substaom y propose to 

Llie. though eonting up need? 

‘'“fl 'Sok nute ruallor some test, and study the information that .s 

given aboul this? ihe score to be ctpccted from blind 

b. Figure out the chance equivalent. VVhat hm.ta- 

guessing) for . ,t,- 

fions does this “"TnAeosJSness of the test at the upper 

c What limitations arc there on in 

d. '„"l”'m‘;’nr^’TCore points . ptohle 

to plot it and use the results. 



Chapter 7 

▼ 

Qualities Desired in Any 
Measurement Procedure 


Whenever a worker in psychology or education desires to measure 
some quality in a group or individual, he faces the problem of choos- 
ing the best instrument for his purpose. Ordinarily there will be sev- 
eral tests or testing procedures that have been developed for, or that 
seem to be at least possibilities for, his purpose. He must choose 
among these. He is also probably interested in determining not only 
which is the best procedure but how well it satisfies his needs by some 
absolute standard. On what grounds can he make his choice or his 
appraisal? 

There are many specific considerations entering into the evaluation 
of a test, but we shall consider them here under three main headings. 
These are respectively validity, rcliabihiy, and practicality. Validity 
refers to the extent to which a test measures what we actually wish to 
measure. Reliability has to do with accuracy and precision of a meas- 
urement procedure. Indices of reliability give an indication of the 
extent to which a particular measurement is consistent and reproduc- 
ible. Practicality is concerned with a wide range of factors of econ- 
omy, convenience, and interprelability that determine whether a test 
is practical for widespread use. These three aspects of test evaluation 
will be considered in detail in the following sections. 

VALIDITY 

The first and foremost question to be asked with respect to any 
testing procedure is: How valid is it? When we ask this question, we 
are inquiring whether the test measures what we want it to measure, 
all of what we w’ant h to measure, and nothing but what we want it 
to measure. 

When we apply a steel tape measure to the top of our desk to de- 
termine its length, we have no doubt that the tape does in fact meas- 

160 



urc the Jtngih of the desk and does directly sene our purpose, which 
may be to determine whether the desk will fit betw-een two windows 
m oar room. Long experience with this type of measuring instrument 
nas confirmed beyond a shadow of doubt its validity as a tool for meas- 
uring length. 

Suppose now that we give to a group of children a test of reading 
achievement. This test requires the children to select certain answers 
to a scries of questions about reading passages and to make little pencil 
marks on an answer sheet. We count the number of pencil marks 
made in the predetermined right places and give the child as a score 
the number of his right answers. We call this score his reading com- 
prehension. Dut the score itself is not (he comprehension. It is the 
record of a sample of behavior. Any judgment regarding compre- 
hension is an inference from this number which is the number of al- 
iegediy correct answers. Its vaiidity fe not self-evident but is some- 
thing we must establish on the basis of adequate evidence. 

Consider again the typical personality inventory that endeavors to 
provide an appraisal of “emotional adjustment.” In this type of in- 
ventory the respondent marks a series of statements as being charac- 
teristic of him or not characteristic of him. On the basis of various 
types of procedures, which we shall consider in some detail in Chap- 
ter 12, certain responses are keyed as indicative of emotional malad- 
justment. A score is obtained by seeing how many of these responses 
an individual selects. But making certain marks on a piece of paper 
is a number of steps removed from actually exhibiting emotional dis- 
turbance. 'W'e must find some way of establishing the extent to which 
the perhrmance on the test actually corresponds to the quality of be- 
havior in which we are directly Interested. How can we determine 
the validity of such a measurement procedure? 


TYPSS OF EVIDFSCE OF VAUDITY 

A test may be thought of as corresponding to some aspect of human 
behavior in any one of three senses. For these three senses we shall 
use the terms (I) represent, (2) predict, and (3) signify. Let us ex- 
plore each of these three, so that we may understand clearly what is 
involved in each case, and for what kinds of tests each of the three 
is relevant. 


VAUOITY AS KemSENTISG 

Consider a test that has been prepared to measure achievement in 
using the English language. How can we teil how well the test docs in 
fact measure that achievement? First, we must reach some agreement 



QUAIITIES DESIRED IN MEASUREMENT 

as to the skills, knowledge and understanding that 
and efieetive use of English, and that have been the ^ 

guaae instruction. Then tve must examine the lest to see * ' 
So'sviedee and understanding it calls for. Finally, we must inatch the 
analvsis of test content against the analysis of course content and m 
structional objectives and see how well the former represents the la - 
ter. In proportion as the outcomes that v.e have accepted as g 
for the course are represented in the lest, the test is valid. 

Since the analysis is essentially a rational and judgmental one, in 
is sometimes spoken of as rational or logical validity. Since the anal- 
ysis is largely in terms of the content of the test, the term 
validity is also sometimes used. However, we should not thin ' o 
content too nar^o^A^y, because we may be interested in process as 
much as in simple content. Thus, in the field of English expression 
we micht be concerned on the one hand with such “content elemen 
as the^rules and principles for capitalization, use of commas, or spel^ 
ing words with “ei” and “ie” combinations. But we might also be 
interested in such “process” skills as arranging ideas in a logical^ order, 
W’ritjng sentences that present a single unified thought, or picking the 
most appropriate word to convey the desired meaning. In a sense, 
content is what the pupil works with; process is what he docs with it 
The problem of appraising content validity is closely parallel to the 
problem of preparing the blueprint for a test, as discussed in Chapter 
3, and then building a lest to match the blueprint. A teacher’s own 
test has content validity to the extent that a wise and thoughtful anal- 
ysis of course objectives has been made in the blueprint, and care, 
skill and ingenuity have been exercised in building lest items to mate 
the blueprint. A standardized test may be shown to have validity for 
a particular school or a particular curriculum insofar as the tasks Aat 
it presents to the examinee correspond to and represent the objectives 
accepted in that school or that curriculum. 

It should be clear that validity c\-idenced as representing, i.e., ra- 
tional or content validity, is important primarily for measures o 
achievement. When we wish to appraise a test of reading compre- 
hension, of biology, or of American history', we can really do so only 
by asking: How well do the tasks of this test represent what we con- 
sider to be important outcomes in this area of instruction? How' well 
do these tasks represent what the best and most expert judgment would 
consider to be important knowledge and skills? If the correspondence 
is good, we consider the test valid; if poor, the validity must be deemed 
to be low. 

The responsible maker of a lest for publication and widespread use 



VAIIDITY 

ones to considerable pains to detcimine the widely accepted goals ol 
SruetionTn the f.eld'^^n which his test is to -= buht. -naate » tnany 

pearing in yearbooks of one or a course (5) specialists 

‘"S— 

the blueprint for his test, and in community to community, no 

test items. Decause of varta^ns „ „tj,rtives of 

pnbUshed test can be made to lit ex ^ ^ 

every local course of study. community than an 

tional basis is always less valid tot P situation, 

equally workmanlike test tailor P common eompo- 

However. the well-made and courses ol study 

nents that appear repeatedly i „n,tsinl! ihe common core that 
and builds a test around them. “ P 

is central in the different '~Xe„ scid that the relationship 

Tt should be clear from "'■“‘.bas P content is 

between teaching and testing i yP proposed to be taught, 

drawn from what has been mnght. materials^ 

the instructional program is He Hinking underlying 

Sometimes the thinking m a specialists have been brought 

local course of study, as when a S^P " "j’ emerging trend in 

together to design a test '“'"'P®'” ucVmd, as when the test is 
education. Sometimes the test may ^ j emphasized m estab- 
based on the relatively conventional objectives v 

VAtlOiry AS pBEDicriNG 5C„C specific 

prequently we me f 

represented at le , ^pieje machine operators .^,.rion as high 
an employment test t P by some Por this 

snecesstul employe ■ as^^ P^ personnel turnover, 

production with litt" P 



qualities desired in measurement 

EHEErSiHEirS 

Our evaluation ot a test as predicting is an =™P ' 

statistical evaluation, and this aspect of validity has j 

spoSn of as en,pHcal or statistical validity. The ba.c pr dure s 
to give the test to a group who are entering some job 
gram, to follow them up later and get for each one so t 
Lasure of success on the job or in the "^tasure 

to compute the correlation between test score ,, 

of success. The higher the correlation, the more effective t 

a predictor. . c-- px- 

This relationship can also be pictured m various ™ “ 

ample, the bar chart in Fig. 7.1 shows the /""“I “ 

pilot training at each ot nine score levels on a 
Lamination of the chart shows a steady increase in P , 
failing training as we go from the high to the low scores '^c « 

tionship pictured in this chart corresponds to a correlation coeffi 


The Problem of the Criterion. We said above that predictive 
ity can be estimated by determining the correlation between t«t scor 
and a suitable criterion measure of success on the job. The jo 
here is the phrase “suitable criterion measure.” One of the most i • 
cult problems that the personnel psychologist or educator faces is t a 
of locating or creating a satisfactory measure of job success to se^e 
as a criterion measure for test validation. It may appear to t e s u 
dent that it should be a simple matter to decide upon some measure 
of rate of production or some type of rating by superiors. It may also 
seem that this measure, once decided upon, should be obtainable m an 
easy and straightforward fashion. Unfortunately, this is not so. m 
ing or developing acceptable criterion measures usually involves t e 
research worker in the field of tests and measurements in a number 
of troublesome problems. 

Difficulties in obtaining satisfactory criterion measures anse fro 
a variety of sources. There are many types of jobs, such as those o 
physician, teacher, secretary, or stock clerk, that yield no objective 


•Thh h not entirely true. What a test “looks like" may be of importance 
In determining its acceptability and reasonableness to those who will be 
Thus, a group of would-be pilots may be more ready to accept an arithm 
test dealing with wind drift and gasoline consumption than they would t ® 
essential problems phrased in tenra of costs of crops or of recipes ft^r a i 
cakes. This appearance of reasonableness is sometimes spoken of as ace 


VAtlDJTY 


165 



record of performance or production. But even when such records 
arc available, they are often influenced by a variety of factors outside 
the worker’s control. Thus, the production record of a weaver may 
depend not only upon his own skill in threading or adjusting the loom 
but also on (he condition of the equipment, the adequacy of the light- 
ing where he must work, or the color of the thread he must weave. 
The sales of an insurance agent are not only a function of Ws own 
effectiveness as a salesman but also of the territory in which he must 
work and the supervision and assistance be receives. The problems 
of effective rating of personnel are discussed in detail in Chapter 13. 
It suffices to indicate here that ratings are often unstable and influenced 
by many factors other than the pnrficicncy o£ the person being rated. 

There are alwaj’s many criterion measures that might be obtained 
and used for validating a selection test. In addition to quantitative 
performance records and subjective ratings, which have already been 
mentioned, we might use later tests of proficiency. This is the type 



OUAlltlES DESIRED IN MEASUREMENT 
of sUualion Iha. is involved 

is validated in terms of its abi i y ** Here the com- 

=:;5ti ■» “ 

validated against course grades i" tngtneenng M 

AU criterion measures are onI> partial m th .^1131 iob per- 

a part of success on the job or only the PTel.ntman« *“ 
tormance. This last is true of the engineenog «hoo 
above They represent a relatively immediate but 
Jrio'of suLs'^as an engineer. The ultimate criterion is some^a^ 
praisal of the man’s lifetime success in his profession 
nature of things, such an ultimate are only 

we must be satisfied with substitutes for it. The^ J a,„a/s 

partial and are never completely satisfactory. Our problem is 
to choose the most satisfactory from among the meas 
appears feasible to obtain. We arc faced, then, with the pr m ^ 
deciding which of several criterion measures is most satisfactory, 
shall we arrive at this decision? „„,litics 

Qualities Desired in a Criterion Measure. There are four q 
that we desire in a criterion measure. In order of their impo 
they are (1) relevance, (2) freedom from bias, (3) reliability, 

(4) availability. . 

We judge a crilerion to be relevant to the extent that score o ^ 
criterion measure is determined by the same factors that determ 
success on the job. In appraising the relevance of a criterion, we ar 
thrown back once more upon rational considerations. 
empirical evidence that will tell us whether a particular criterion 
ure is or is not relevant. For achievement tests we found it necessary 
to rely upon the best available professional judgment to determine 
whether the content of the test accurately represented our objectives. 
In the same way, with respect to a criterion measure it is also 
to rely upon professional judgment to provide the appraisal o ^ 
degree to which any av'ailablc partial criterion measure is relevan 
to the ultimate criterion of job success. 

A second factor important in a criterion measure is that of free otn 
from bias. By this we mean that the measure should provide eac 
person with the same opportunity to make a good score. Examp 
of biasing factors are such things as variation in wealth from one iS" 
trict to another in our previous example of the insurance salesm^* 
variation in the quality of equipment and conditions of work o 3 


VAIIOITY 

factory worker, variation in geirerosity of the bosses ratine private 
secretaries, or variation in the skill of teachers instructing pupils in 
S=«n. classes. We can see .ha. i. will be difflcu .0 ge. meanin 
from .he relationship of test results .0 a cr, tenon s 0 e ' y 
depends upon factors in the conditions of work rather than 

in the individual worker. _ , . iot.»r in 

The topic of reliability will be discussed in general terms later 

S;crLL“cris^::^«aro-« 

the person who shows high job pe om j^||i,y of finding a lest 

K— r TS -- - 

practical problems of „nh indWidual? How much 

Ug to take to get ““'Ta oemounet 

is it going to cost? Though p . ctiterion 

afford to spend a substantia part M '"^”4 of a criterion meat- 
data, there is always a 

ure must take this practical limit into account 

THE INrEkPaETAtlON Of VAUDIIT j|,o„S for a group 

Suppose that we have gathered l«^aud c^ perhaps 

of individuals and computed the ,5 a„ aver- 

our predictor is a scholastic “P*'*'* {.^i, ^ve now decide ^^hcthcr 

age of college freshman grades. How stm 

SL ^o 

for a particular cntcrion IS S .. ^ •„ Tabic 7-1 • 

vailable to us. coefficient exhibite i ^ 

Some representative vahdiO „rrelatiou “ J” ncraci 

c .nme oiciutc ot the SBC i„vcstisaloT conce 


Some representative “"correWiou “ 

hesc give some *'„„rkiods. The of 

il'l’l^clTerrse of study or h validiiic found 


lor his particular criterion. 



168 


QUAUTIHS DESIRED IN MEASUREMENT 


Table 7.1. Validity of Selected Tests os Predictors of Certoin Educotionol 
and Vocoliona! Critenp 


Validity 

Predictor Test Criterion Variable Coefficient 


Pinmer General Ability Test 


ACE Psychological Exam — 
L Score 

Seashore Tonal Memory Test 

Short Employment Test 
Word Knowledge Score 

Word Knowledge Score 
Arithmetic Skill Score 

Arithmetic Skill Score 
Differenliat Aptitude Tests 
Verbal Reasoning 
Space Relations 
Mechanical Reasoning 


Metropolitan Achievement — 
Reading Comp. {Or. 5) 
Metropolitan Achievement — 
Total Score (Or. 5) 

College Grades — English 
College Grades — Math 
College Grades — Art 
Performance test on stringed in- 
strument 

Production index — 80 bookkeep- 
ing machine operators 
Job grade — 106 stenographers 
Production index — 80 bookkeep- 
ing machine operators 
Job grade — 106 stenographers 

English grades 3^4 years later 
English grades years later 
English grades VA years later 


.76 

.84 

.48 

.33 

.24 

.28 


.10 

.53 

.26 

.60 

.57 

.01 

.17 


The usefulness of a test as a predictor depends not only on how 
well it correlates with a criterion, but also on how’ much new informa- 
tion it gives. Thus, the Di0erenUal Aptitude Tests’ Verbal Reasoning 
Test correlates on the average .48 with high-school English grades, 
and a lest of sentence usage correlates .51 with the same grades. But 
the two tests have an interconelation of .62. They overlap and, in 
part at least, the information each test provides is the same as that 
provided by the other test. The net result is that pooling the two tests 
can give a validity coeflicient of no more than .55. If the two tests 
were uncorrelated, each giving evident^ completely independent of the 
other, the combination of the two tests would give a validity coefficienr 
of .70.* 

• Stalhtical procedures have been developed that enable us to determine the 
best weighing to give the two or more predictors and to calculate the correla- 
tion that Vrill result from this combination. The procedures for computing the 
weights for the separate components (called regression weights) and the corre- 
lation (multiple correlation) resulting from them are beyond the scope of this 
discussion but will be found in standard statistics texts. 



acarly, the higher the coirelation beWecn a lest or other predictor 
and a criterion, the more pleased tte shall But m 
relative standard, we should like some absolute one. How hi^h must 
Ihe 7aM.y eoericicnt be for the test to be 

factory” validity? This is a little brt like asking. How high is up. 

However we can try to give some sort of answer. 

To an organiaatUn using a test as a basis i - 

hire a particular job applicant or admit J ' .’J, 

niiieant question is: How .““'Ts" his .e« ton it we oper- 

decision on whom to hire or admit i t^^ 

ate on a purely chance basis or o considerable measure 

ure? The answer to this question cr« ^ A selection 

on the proportion of individua s w ^ 

procedure can do much mor applicants than 

individual who appears to be the bes „ specific 

if we must accept nine out of applicants, Wc 

example, let us assume that ^ p, „i|i fall in the upper 

may then ask what per cent ;'i;“„That per cent of ou, 

half of the whole group m job «;f“’ • p,, aa„t of correct 

“ito rmmsuU fot“cot,ela.ions of dmeten. site, is shown ,n 

Table 7.2. 

. «/k*« 50 Per Cent of Group 

Table 7.2, Per Cent of 

Per Cent of 


Validity of 

Selection Procedure 
.00 
.20 
.40 
.50 
.60 
.70 
.80 
30 


Correct Choices 

50.0 
564 

63.1 

66.7 

70.5 

74.7 

79.5 

85.6 


Table 7.2 indicates ‘hat “Ty 'to chance vato , Fifty 

rc;r'of::r.:Vsaredcr.n^asSuc«^ 

Tper halt of the total S'toP. ha-= “""'mTas to 

employees by lUst toP‘|'= .^mcat in our 'to"”',' , OTrelatien 
cent of the time. The imp™ Thus, tot 


ccni wi k"'- , V .« 

cortelalion goes up is shown 



qualities desired in measurement 

0 , .40 we wm pick riEh. 63.1 per cent of the lime; with a correlation 

of .80 our percentage will be 79.5, “ “ ° ■ . correlation 

The table shows not only our accuracy fo any g 

hy one wi* a validity of -M, ™ -“;f,;"“Xges refer, of 

^.IrseTre Toond rules set in ihe previous paragraph. However 
T,H, 7 2 Les a fairly representative basis for understanding the 
effects of a selection program from the point of view of the emp o 

'"^In^^mtnTsS^Iuations. the gain can be crudely ..nsiated into 
a doUars-andHients saving. TTus. it it costs a “-opany S5M < 
employ and train a new worker up to the point of “““ 5 /i 
a selection procedure that raised the per cent of voccesses from 56^^ 
to 63.1 svould yield a saving in wasted training expens 
S3350 per 100 men tested. This takes no account of the P«s.b'ii 7 
that the test-selected men might also be belter workers after they na 

completed their training. ..«rri.l«»tion 

Another way of appraising the practical significance of a 
coefficient, and one that is perhaps more meaningful from me P ^ 
of view of the person being tested, is shown in Table 7.3. ^ 

in the little tables represent the fourths of a group of applicants, 
tential students or employees, with respect to a predictor t«t. 
columns indicate the per cent of cases falling in each fourth o 
criterion score. Look at the little table in Table 7.3 correspon 
a validity coefficient of .50. We sec that of those who fall m i 
lowest fourth on our predictor 480 out of 1000 or 48.0 per c»n 
in the lowest fourth on the criterion score, 27.9 per cent in ” 
lovr’est fourth, 16.8 per cent in the next to highest fourth, and 7. ^ 
cent in the highest fourth. Tlie diagonal entries^ represent 
fall in the same fourth on both predictor and criterion. The u 
we get from the diagonal, the greater the discrepancy between prc ‘ 
tion and outcome. 

This table emphasizes not so much the gain from using the 
test as the \'ariation in job success of those who are similar in pre ic o 
scores. From the point of viewr of schools or employers, the ‘®Pf5^ 
tant thing is the improved percentage of accuracy illustrated in Ta 2 
7.2. Dealing in large numbers, they can count on gaining from an> 
predictor that is more valid than the procedure currently in use. From 
the point of view of the single individual, the many marked discrepan^ 
cies between predicted and actual success shown in Table 7.3 m3> 



VAUDITY 


171 


Table 7.3. Accuracy el Prediclmn (or OiKetenl Values of Ihe 
Correlolion Coefficient 
(1000 cases in each row or column) 


Quarter on 
Predictor 
1st 
2nd 
3rd 
4th 


Quarter on Criterion 
^ — Quarter on 

4th 3rd 2nd 1st Predictor 

250 250 250 230 
250 250 250 250 
250 230 250 250 
230 250 250 250 

r > .40 


Quarter on 
Predictor 


2od 

3rd 

4lh 


Quarter on Criterion 

4th 3rd 2nd 1st 
45 141 277 537 
141 264 31S 277 
277 318 264 Ml 
537 277 141 45 


Quarter on Crilerioo 

4,h 3rd 2nd 1st Prudi""' 

‘»d S % 

191 255 277 277 
IV 277 255 191 
428 277 191 104 


ist 

2nd 

3rd 

4th 


r».70 

Quarter on Criterion 

4th 3rd 2nd 1st 
22 107 270 601 
107 270 353 270 
270 353 270 '07 
001 270 107 34 


acen. at least as Intpotta. J 

plicant may be less than by the fast that 

he wiU be below avetaS' Mception. Validity 

very well. He may emphasized in vvben 

One further point ^0" cunieolum or “ ehU* “PP'"" 

is always specific to a p jest is va i . accurately 

an author or pubUsher cla^s that ^,„j„es tot 

priate to ask: Valid tor what? A 



QUALITIES DESIRED IN MEASUREMENT 

Lut store sales clerks who will be pleasant to 
about their stock, and accurate in Hnancial transactions may be en irdy 
useless in identifying ehectivc insurance salesmen ‘ 

find or create new business. Validity must always be evaluated 
relation to the specific situation in ivhich a measure is to be use . 

VALIDITY AS SIGNIFYING 

Sometimes we ask, with respect to a psychologic^ 

“How well does this test predict job success?’ nor How 
this test represent our curriculum?", but “What does this tes 
or signify?" What does the score tell us about an individual, 
it correspond to some meaningful trait or construct that will e P 
in understanding him? For this question of whether the “ 

something meaningful about people the term consiruci validi y 
been used. . 

Let us examine one specific testing procedure and see how us 
lidity as a measure of a useful psychological quality or construct wa 
studied. McClelland ' developed a testing procedure to appraise tne 
individuaVs need or motivation to achieve — to succeed and do we . 
The test used pictures like those in the Thematic Apperception e 
(see Ch. IS). The individual was called upon to make up a story 
about each picture, telling what was happening and how it turned out. 
A scoring system was developed for these stories, based on counting 
the frequency with which themes of accomplishment, mastery, success, 
and achievement appeared in the story material. Thus, each individua 
received a score representing the strength of his motivation to achieve. 
Now, how are we to determine whether this measure has validity in 
the sense of truthfully describing a meaningful aspect of the indivi 
ual’s make-up? Let us see how McClelland and his co-workers pro- 
ceeded. 

In essence, the investigators proceeded to ask: “With what shoul 
a measure of achievement motivation be related?” They made a 
series of predictions. Some of the predictions were as follows: 

1 . Those high on achievement motivation should do well in college, 
in relation to their scholastic aptitude. 

2. Achievement motivation should be higher just after students 
have been taking tests described to them as measuring their intelligence. 



VAUOITY 


173 


3 Those high on achievement motiration should complete more 
'T Sum he higher for children of families 

emphasizing early independence. 

Each of these predictions 

behavior.” Thus, =. 1 ^ higher motivation to achieve 

ability and efTort. Presumably h = 

■“ 

««— .-j; 

supported by the ejperiniental that was pre- 

were related to a number o ol « presumably 

dieted from a rational analysis of Ae •' procedure as mcasur- 

measuring lent support to the va ‘ ^ essential characteristics are 
ins a meaningful •A\°VrSi«™enrrotation" 
well summarized by the label ^ g jgjser extent, 

A great many of our general traits or 

some educational tests, f " nlng, spatial visualUmg, so- 

qualities of the individual. Verb designations of traits 

ciability, introversion, "leohomcid ^ they bebave 

or eonstruets. Tests of Aese '“"ctiom are » t, 

in the way that such a trait s o therefore a test of it) s 

Some of the indicators of how a trait t 

behave are; . , are already 

1. Its oorrelations with oAer tests «P^y group 

accepted measures of the part by their correlations “A 

-;7 ”, 

ability to differentiate thtSU Perm""' 

score on achievement nee o education, those w - . 

„,e is higher for those 

comes, those from 0 ^“ rathor AanJ_^^ yA PJ 

their 30’s rather than older g P the validity °* J'f „ „ con- 
tions that we would ■"“Jr';, “ J^nial oondi>‘0"V “^,,^056 of test- 
3. Its response to chan ,he spKiS P ban 

ditions that are naP'"-"'"" J„,,roroent. Thus, ' „ed Acker 

ing the responsiveness ot the insn one study comp 

proposed as an indicator of ansiety- 



qualities desired in measurement 

176 

✓ PFTEST V/tTH THE SAME TEST ^ ^ 

„ we wish to find how reliably we^can 

weight, we '””J"®tLure5' taken independently by two 

precaution to ha\e •ma.ni»»r'< recollection of the first 

'persons. We don't the two 

”=lf;Kbr"trrcr=r;it™^^^^ 

one right variation or “error" is m the 

r we^rn 

;1i;ti"e« wrirZwha^ he 

trS ir:‘^=w;Z"nS Hen a; r— dee 

to the operation of measurement. :„HividuaI; 

•Rnmetirnes we are interested in variation within the indi 
soSrwe are not. We may ask; How aeeurateiy does our mea^ 
urement characterize S at this moment of time? Or we niay as ^Ho^ 
accurately does our measure of S today describe him a ■ .5 

tomorrow, or neat week, or neat month? Both are sensible 
But they arc not the same question. The data we must g 

:„swer one are different from the data we need to answer the othen 
To study the reliability of such a physical charactcnstic of a pe 
as weight, repetition of the measurement is a straightforward ana s 
isfactory operation. It appears satisfactory and applicable also 
some simple types of behavior, such as speed of reaction or musimi 
Strength. But suppose now we are interested in the reliaoility 
test of reading comprehension. Let us assume that the test is 
up of six reading passages with ten question on each. ^ We admins 
the lest once and then immediately administer it again. ^Vhat F 
pens? Certainly, the chfld is not going to have to reread all the rn ' 
terial he has just read. He may do so in part, but to a consid^ 
extent his answers the second time will involve merely remem^ nn 
what answer he had chosen the time before and marking it 
he had not been able to finish the first time, he wfll now be able t 
work ahead and spend most of his time on new material. 
effects wfll hold true to some degree even over a longer period o 
time. Qearly, this sort of lest given a second time does not presen 
the same task that it did the first time. 

There is a second consideration entering into the repetition of sue 



REllAeiltTY 

a test as a reading comprehension test. Suppose that one ot the fee 
passages in the test was about baseball and that a parttcnlar boy ™ 
an expert on baseball. The passage would then / 

him and he would in ellect get a bonus of several points. The tes 

the test look „„st recognize the possi- 

In such an area of ability as reaoing, . , ^ throughout 

bility that an individual does not and background 

the whole area. His specific mtcre , ,„mple 

give him strengths and wcaknKse . P relative 

Trom the whole area. '"'“'iTrL „ “ °h= particular sam^ 

to others, is likely to ° „( jMily or personality we 

plc of tasks chosen to represent the ^ 

Retrying to appraise. » ” he 
urements, his behavior will stay more nca y 
sample of tasks is varied. . j^ufces of variation 

Note that so tar we have f ^ eision ot a particular 
io pertormance that will tend to redu« the prec 
score as a description of an — • 

1. Variation in response to the tea t 

2. Variation in the '"‘‘"'“’“’'J articular sample ot tasks chosen 

3. Variation arising out ot the particuia 

to represent an area of behavior. arranged 

Retesting the individual with procedure cannot 

to reflect the first two types of error. , _ „„ je the 


Ketesting me “error” bm ims - 

to reflect the first two types of erro te the 

evaluate the efTects of Ac t ir referred above, 

memory and practice effects to which « 

^ PARXUEt TEST fOSMS „„;,lion variation arising be- 

Concem about this third “““ ”X,ar sample of tasks to repre- 

eause of the necessity ot el'r’VadC to another set of proe dure 
sent a whole area of behavior. le^s_- ,e a 

for evaluating reliability. If .j case, we want t 

source of -error," and „e specific 

what accuracy we may g j^-resent, we must J" „( msts. 
ot behavior it is suppose due to the mp 

cedures that take accoun of th , ,on„s ot a • 

We may do this by correlating two er 



,73 QUAiniES DESIRED IN MEASUREIAENT 

Equivdent forms of a test should be “- 

accordme to the same specifications but composed 
pks of behavior in the defined area. Thus, two 
Lts should contain reading passages and 
culty. The same sorts of questions should be asked^. 
bailee of specific fact and general idea questions. The ^ 

of passages should be represented, U.. espos.torj', 
esthetic. But the specific passages and questions should be MerenL 
If we have two forms of a test, we may give each pup 
form and then the other. They may follow each other 
if we are not interested in stabUity over time, or may be separateo y 
an interval if we are. The correlation between the two lonm 
provide an appropriate reliability coefficient. If a time mtervtd 
been allowed between the testings, all three of our sources of yanaM" 
will have had a chance to get in their effects — ^variation arising r 
the measurement itself, variation in the individual over time, an 
variation due to the sample of tasks. 

To ask that a test yield consistent results under these condiu 
Is the most rigorous standard we can set for it. And if v.e 
use our test results to generalize about what Johnny will do on o 
tasks of this general sort next week and next month, then this 
appropriate standard by which to evaluate a lest For most e uca 
lional situations, this is the way we want to use test results,^ an 
evidence based on equivalent lest forms should usually be gl'cn l 
most weight in evaluating the reliability of a test. ^ 

The use of two parallel lest forms provides a ver>’ sound 
estimating the precision of a psychological or educational test. ^ 
procedure does, however, raise some practical problems. It deman s 
that two parallel forms of a test be available and that lime be allowe 
for administering two separate tests. Sometimes no second form o 
a test exists, or no lime can be found for a second testing. ^ 
minislcr a second separate test is often likely to represent a somew a 
burdensome demand upon available resources. These practical con 
sideraiions of convenience and expediency have made test makers re 
ceptise to procedures that extract an estimate of reliability from a 
ministration of only one form of a test. However, such procedures 
arc compromises at best. The correlation between two parallel forms, 
usually administered with a lapse of sts'cral days or weeks in between, 
represents the preferred procedure for estimating rcliabiUty. 

SUSOfVlOfO TEST 

The most widclj used procedure for estimating reliability from a 
single testing divides a particular test up into two presumably equiv 



KEIIASIIITY 


V9 


alent halves. The hall-Iesls may be assembled on the basis o! careful 

examination of the content and dilficum of each a ^ 

tematic effort to balance out the content and difficulty le el of th 

halves. A simpler procedure, svhich is often relied upon to give 

equivalent halves, is to put alternate items 

is, to put all the odd-numbered items in cue half-test f ' 

uimbLd items in the other. This is 

since items of similar form, content, ht y 

grouped together in a test. For a tea y balance 

items or more, splitting the test up dimcully level. The 

out factors of item form, 

two half-tests will have a good ptobabdity of ^nstituting q 
tests, as these are defined in the halt only 

The procedures we are diseasing „„ i, given at 

for scoring, not for adminislrat n.^ • „ ' two separate 

a single sitting and with a sing e im iums and one by 

scores are derived — one by scon g _(,-«)aiion between these two 

scoring the even-numbered i.cms Tbe correlau™ 

scores provides a measure of the accuracy 

measuring the individual. computed correlation is between 

However, it must be noted that the p applicable to the 

two half-length tests. This value = p,Jp„cd for use. In 

fuU-length test, which is the actual OT behavior we h.ive, the 
general, the larger the ^ ore behavior we record, the 

mote reliable the measure will te. ™ .„ behavior of the 

less our measure will depend up^ ebn 5 , ,, .ecky answers 

individual or in the nearly evened on. 

or momentary lapses of atten ccotes nc tuahy 

Where the two bnl® n' f ' f „„ unbiased estimate of tma^- 
correlated, arc '1“''’''” ’ lalion between the two half-tests, 
test reliability from the orreblion 

estimate is given by the formula 

2ryfii ( 1 ) 

where r., is 
formula 1 would give 

2 ( 0 . 60 ) _ 2 ^:!; _ .75 
fit “ JTj. oM 



,jg QUAiniES DESIRED IN MEASUREMENT 

This formula, referred to generally as the 
eey Formula from the names of its originators func on rnakes . 
possible lor us to compute an estimate of rel, ability from a smgl 

administration of a single test. hoc led to 

The appealing convenience of the spht-half procedure 
its wide use. Many test manuals will be found to 
of reliability coefficient and no other. Unfortunately, is 
has several types of limitations, which we inust now 

In the first place, when we have extracted two scores from 
testing, both scores necessarily represent the individual as 
the same moment of time. Even events lasting only a fe 
will affect both scores about equally. In other words, varia 
the individual from day to day cannot be reflected m this VP 
reliability coefficient. It can only give evidence as to the prcci 
with which we can appraise him at a specific moment in time. 

In the second place, a split-half reliabUity coeflicient becomes mean- 
ingless when a test is highly speeded. Suppose we have a tes 
simple arithmetic, made up of problems like 3 -1- 5 = ?, and tha 
test is being used with adults with a 2-minute time limit. c 
get wide differenees in score on such a test, but the differences wi c 
primarily differences in speed. Errors will be a minor factor. _ 
person who gets a score of 50 will very probably have attempted jus 
50 items, and of these 25 will be odd and 25 will be even. _ In other 
words, the two halves of the test will appear perfectly consistent, be- 
cause opportunity to attempt items is automatically balanced out for 
the two half-tests. 

Few tests depend as completely upon speed as does the one tna 
we have chosen to illustrate our point. However, many involve^ some 
degree of speeding. This speed factor will tend to inflate estimates 
of reliabUity based on the split-half procedure. The amount of over- 
estimation will depend upon the degree to which the test is speede , 
being greater for those tests in which speed plays a greater role. How 
ever, speed enters in sufficiently generally so that split-half estimates 
of reliability should always be discounted. Test users should deman 
that commercial publishers provide reliability estimates based on 
parallel forms of the test. 


RElfAB/UTY ESTIMATED FROM ITEM STATJSTICS 

The teacher or investigator who makes much use of tests and who 
reads extensively in test manuals will encounter one other type o 
procedure for estimating test reliability from a single test administra- 



REUABILnY 

lion This procedure, also named for its originators, yiete what is 
reterrcd to L a Kuder-Richardson reliability eoeflicienb The esse^ 
tial assumption in the procedure is that the .terns w.th.n one fo™ 
of a test have as much in common with one ^ 

in that one form with the corresponding items in a parallel or equiva 

or personality as do the ““ers. jsu„ate that has 

Kuder-Richardson precedure /coefficient we have 

essentially the same interpretation likewise (1) takes 

^rSra^— -he spU-hah type of reiiahiiity.- 
COMf^mON Of Ml,HCOS 

A summary companson of te ^ ^cmcs that may 

reliability is given m Table 7.A i individual's usual 

make a single test score factors are represented 

performance. The table sho reliability we have discussed. 

"es“rSirr;'.helr ejects. Each of the other 

• A widely used form 
takes the form 


„,.aeK«de,.Rict«n.»np.<-d- (tbdr Fwnwb SO) 
(^)(^) 


here ni i< lire °'u «<“■ 

„ ir "“"’J'lTSon ot .he lO. 
r, is the „ ri" and ewers ibe 




.l':;h!an„n 

samng as B"' 



,5, QUAllTIES DESIRED IN MEASUREMENT 

S=.c. oE Rep,e>e^a E„ Pme.n, .oced.e, .or 

EjUmoting Reliobiuty 

I^!V1 

Rrtf< T«^ 

Ins'Nlit'- 
Rrt«t. Istrrvil. 




Safiy* Ticv 



SrJTCTS of Varalion T**' 

Hersr mmek tki iton ri- 



ftcitd to fi»e:*aU w- 

VaJiakiofu ariAir* wilt:" In' ,, 

iD-^jTjreiaT.: itK-l 

ClaB*« «“ Tjdusl fiom 

day w> day . , 

Chaar-* in the sample 

K X X X 

X 

; X 

X X 

X X > 

ti taiVi 

Ctaajn ii Os' joividoal » 

»pe^ C( SKlTlC 

X X X 



„,eftods masks some source of variafon .ha. 

acual use of tests. Retesting uith the *“"’=’‘’:"'^^“he testing 
variation arising out of the sample of items. 'Y*''"''" , , jj.. to 
is done a. one point in time, variation of the md.vidual t™m d^^ 
day is neglected. WTien the testing is done as a unit ;Mth a s ^ 
limit, variation in speed of responding is neglected. ^ Y ^;ta 
out in thU table should be borne tn mind in evaluating rehabil ) 
found in a test manual or in the report of a research study. 

INTEfyRnATION OF PEUABIUTY OAJA 

Analysis of data obtained from a general intelligent test for 
mentaty-school chUdren has yielded a reliability coefficien f-^ 
How shaU we interpret *b result? TVhat does it mean 
die precision of an individual’s score? Should we be pleased or 0 «- 
satisfied to get a coefficient of this size? ^ rnrre- 

We have already tried to give some content and meaning _ - 
lation coefficients in Fig- 5.7 and in Tables 5.8, 7.1, 7. , ^ 
These have shown tj'pical values of the correlation coeffiCTC , 
scatter of scores for representative correlations, and the accuTO 
prediction v-ith correlations of different sizes. A further con n 
to the inteipretaUon of test reliabflity is found in the relationship 
tween the reliab'iliU' coefficient and the standard error of measureme 
It wUI be reme^nbered that the standard error of measureraen 
an estimate of the standard deviation that would be obtaine or 
series of measurements of the same indmdual. (It is assumed 



REIIABIUTY 


, = StV 1 — 


( 2 ) 


where i„ is the standard error of measurement. 

!, is the standard deviation of test scores, 
m is the reliability cocfBcient. 

suppose that our test has a rehabUity of .S5 and a standard de- 
viation of 15 points. Then we have ^ 

i„ = I5V1 - = 15V-I5 = 57 

In this instance, a set of rneasures ” fairly uniform 

a standard deviation of 5-7 P“‘" . „„,„bcr of standard 

proportion of observatmns “ tor this relationship 

deviation units from the mean. normal curve 

were given in Table 5.6. the by as 

31.8 per cent of cases, b'.P'’™* ' “ t,y as much as 2 standard 
much as 1 standard to nhich the standard devia- 

deviations. Applying this to oat to„e is 

tion of our measurements is 5. P individaal diltcrs 

about 1 chance in 3 that a ^ 57 Lnts (1 standard error 0 
from his “true" swre by as much 25 mat it differs by as 

measurement). There is , . _r,rs of measurement). 

much ns 11.4 points (2 , ,tpreseolalive of what might b 

The values shown ( J,, me of the commercially d s 

found lor intelligence quolCTt t the nppr e 

tributed group intelligence relatively hig re 1 

mentaiy grades. Note ^"„„surement are possible in at jem 
eoeffieient, of 5 or 10 points of '0 „,ho 

a minority of cases. Shtt s measurement. Any 

fairly frequently just beca totrrpret an 10 has-been 

te;2r::t7c-sjr.«^ 


to the reliabiiuy jgcreascs 
magnitude of errors decre 



OUMITIES DES19ED IN MEASUREW.ENT 

Toble 7J. Standard Error ol Measurement for Different Values of 
Re^lobility Coefficient 


Standard Error of Nfeasurement 


Reliability 

Coefficient 

General Expression 

When 5, • = 10 

JO 

.71 S, • 

7.1 


.63 S, 


.70 

.80 

J5 S, 

.45 S, 

5J 

4.5 

.85 

.90 

.95 

.38 S, 

.32 S, 

.22 S, 

3.8 

3.2 

2.2 

.98 

.14 S, 

1.- 


• S, signifies the standard deviation of the test. 


see that errors of appreciable size v-il! still be found even w-alb reliabil 
ity coefficients of .90 or ,95. In interpreting the score of a pamwiar 
individual, it is the standard error of measurement that must be Mp 
in mind. If v.e ihinV: of a ranee extending from 2 standard errors o 
measurement above the obtained score to 2 below, we will 
band within which we can be reasonably sure ( 19 chances in 20) t ^ 
the tndividuars true score lies. Thus, in the case of the intelligen 
test described in previous paragraphs, we can think of a test IQ o 
90 as meaning rather surely an IQ Ijing between about 80 and^ 

If we think in those terms, we shall be much more discreet in inter 
preting and using test results. 

When inlerpreling the test score of an individual, it is desira e 
to think in terms of the standard error of measurement and to 
somewhat humble and tcnlarivc in drawing conclusions from 
test score. But for making comparisons between tests and for a num- 
ber of types of test analj'sis, the reliability coefficient will be^ more 
useful. \S'herc measures are expressed in different units, as height m 
inches and weight in pounds, the reliabilitj' coefficient provides^ c 
only possible basis for comparison. Since the ojmpeting tests m ^ 
pven field, such as primary’ reading, are likely to use types of 
that are not really comparable, the reliability coefficient will usu^> 
represent the only satisfactory basis for test comparison. Other 
being equal, we shall prefer the test with the higher reliability 
dent, that is, the test that provides a more consistent ranking of the 


individual within his group. 



185 


REUABIllTY 

The other things that may can appraise it, 

of validity and ptacttcality. Reliability is tapor- 

is the crucial test of a have validity. The 

tant only as a necessary condition tor a M jeliabllity. A 

ceiling tor the possible validity »hat »e wan. 

test must measure aom"'"';S “liability of .00 is rdlcc.mg 

to measure. A measuring dcvi correlate ivith itself and can- 

nothing but chance factors. It " |,a„,c,ical ceiling for .h= »al.d- 
not correlate with anything ' . u „ „i,h some criterion 

ity coeiricicn. of a test (i.c , ns “"“‘2 job) is the square root of 
representing success in 'earong reliability “='*“'^'abih 

its reliability cocincient. Thus, and one with a reha 

could not give a validity coe ^lyy yield a ^ acu- 
ity eoctricien. of .64 a test measures something aeeu 

of intelligence. Validity -ometimes lead 

S:S»s of e^. ^ S ^;r blc- 

,bc 3 hours 0 ' 'f Saned to serve. paragraphs, we 

’’Tvr;in^hr.imi.a.ions^ 

-r^:S!dytheparagraph^-;^^^ 

1. Konse "I "" fees each individual relative m t ^ ^ 

coLis.en.ly a test P'“ 'y“bil.ing from <«' “ i^'ttsa. But the 

“as5£S%2=^ 



,JJ QUAIITIES DESIRED IN MEASUREMENT 

It children from several diftcrent grades are poded togrther, we 
may expect a much higher reliability coefficient. For 
manual for the Oth Quick-Scoring Mental Ability Test Be a rep 
alternate-forms reliabilities for single grade ' oin ^65 

to .87. The average value is .78. But pooling the complete range 

of grades (4-9). the reliability coefficient is reported as .96. These 
data are all tor the same test. They reflect 'Ilf.'™' 
the coefficient for the combined groups is strikingly higher, 
data are reported for the Durrell-Snllivan Reading Achievement lest- 
The data in this case involve a range of four grades from gra = 
through grade six. Reliability coefficients are split-half pliabilities 
based on a single testing. In the case of the Word Meaning Tes , 
average coefficient for a single grade is .93, whereas the coraela 
tor all four grades together is .97. For the test of Paragraph Meaning 
the corresponding values are .87 and .94. 

In evaluating a reported reliability coefficient, the range o 
in the group tested must be taken into account. If the reliabt sty 
coefficient is based upon a combination of age or grade groups, Jt must 
usually be sharply discounted, as can be seen above. But even in 
less extreme cases, account must be taken of the variability of ta en 
within the group. Reliabilities for age groups will tend to be some- 
what higher than for grade groups, because an age group will usually 
contain a greater spread of talent than a single grade.^ A 
made up of children from a wide range of socio-economic levels wi 
tend to yield higher reliabilities than a very homogeneous one. n 
comparing different tests, one must take account of the type of sam- 
ple on which the reliability data were based, in so far as this can e 
determined from the reported facts, and judge more severely the tes 
whose reliability is based on the more heterogeneous group. 

2. Level of Ability in the Group. Precision of measurement ® 
test may be related to the ability level of the persons being measured. 
However, no simple rule can be formulated for stating the nature o 
this relationship. It depends upon the way in which the particular 
test was built. For those people for whom the test is very hard, so 
that they are doing a large amount of guessing, accuracy is likely to 
be low. At the other extreme, if a test is very easy for a group, so 
that all of them can do most of the items very easily, it may be ex- 
pected to be ineffective in discriminating among the members of the 
group. When everyone can do the easy items, it is as if we had short- 
ened the test to just the few harder items that some can do and some 
cannot. 

It is possible, also, that a test may vary in accuracy at different 



intennediate difficulty levels. The meticulous test constructor will 
report the standard error of measurement for his test at different score 
levels. When separate values of the standard error of measurement 
are reported in the manual, they provide a basis for evaluating the 
precision of the test for different types of groups. They permit a more 
appropriate estimate of the accuracy of a particular individual’s score. 
Each individual’s score can be interpreted in relation to the standard 
error of measurement for scores of that level. For example, from 
data provided by Teman and Merrill ‘ the standard error of meastire- 
tnenC for the 1937 edition of the Stanford'Binet for different IQ levels 
is found to be as follows: 


}Q Level 
130 and over 
llfr-129 
90-109 
70- 89 
Below 70 


SfanJard Error of IQ 


2.2 


For this test, the variation that may be expected from one testing to 
another Is very much higher for children with average and above aver- 
age IQ's than for the retarded child. In the case of the K'ecMer fn- 
teUigence Scale for Children, the standard error of measurement de- 
pends upon the age of the group tested. The manual reports values 
a.<! follows: 

71e-year-oIds 4.2 points of IQ 

10>& •* “ 5.4 “ •• “ 

33M •• - 3.7 - •* " 

The test is most accurate for an age group in the middle of the age 
range for which it was intended. 

3. Length of Test. As we saw on p. 179 in discussing the split-half 
reliability coefficient, test reliability depends on the length of the test. 
If we can assume that the quality of the test items and the nature 
of the examinees remain the same, then the relationship of reliabilit)' 
to length can be expressed by a simple formula. The formula is 

, ( 3 } 

I + (n ~ Dm 

where r„ is the reliabifity of a test n times as long as the original test, 
m is the reliability of the original test, 
n is, as indicated, the factor by which the length of the test is 
increased. 

This is a more general form of formula 1 found on p. 179. 



^gg QUAUTtES OESJREO IN MEASUREMENT 

Suppose we have a spelling test made up of 20 items which has a 
reliability of .50. We want to know how reliable the test will be if 
it is lengthened to contain 100 items comparable to the original 20. 
The answer is 

^ 5(.50) ^ 2.50 _ 

1 + 4(.50) 3.00 

As the length of the test U increased, the chance errors of measure- 
ment more or less cancel out; score comes to depend more and more 
completely upon the characteristics of the person being measured; 
and a more accurate appraisal of him is obtained. 

Of course, how much we can lengthen a test is limited by a number 
of practical considerations. It is limited by the amount of time avail- 
able for testing. It is limited by factors of fatigue and boredom on 
the part of examinees. It is sometimes limited by the stock of good 
test items that it is possible to construct. But within these limits, 
reliability can be increased as needed by lengthening the test. 

One special type of lengthening Is represented by Increasing the 
number of raters who rate an individual or a product he has produced. 
If several raters of equal competence or equal familiarity with the 
ratee are available, a pooling of their ratings will produce increased 
reliability in the composite rating, and this increase will be described 
by the same formula we have just been considering. 

4. Operations Used for Estimating. How high a value will be ob- 
tained for the reliability coefficient depends also upon which of the 
several possible sets of experimental operations is used to estimate 
the reliability. We saw in Table 7.4 that the different procedures treat 
different sources of variation in different ways, and that it is only the 
use of parallel forms of a test with a period intervening that includes 
all four sources of variation in “error.” That is, this procedure of 
estimating reliability represents a more exacting definition of the test’s 
ability to reproduce the same score. The individual must then show 
consistency both from one sample of tasks to another and from one 
day to another. We have gathered together a few examples that show 
reliability coefficients for the same test when these were computed by 
two different procedures. These are shown in Table 7.6. 

The two procedures compared In Table 7.6 are correlation of alter- 
nate forms and correlation of half-tests made up from a single form. 
It will be noted that the aUemate-forms correlation is lower in every 
case. This is consistent with our earlier discussion, in which we 
pointed out that the altentate-forms procedure constitutes a more 
demanding test of an instrument’s precision. The difference between 



REIIABIIITY 


Tcbb 7.6. Compo™ of Roliobaily CoofSdonB Obtoired from Eqoi.oloo, 
Forms and from Fractions of a Single Test 


Alternate SingJe 
Forms Test 


Ohs Quich-Scomi! Imelimiw Test—n^M 



Pintner-Durost Intelhgence Test 

Scale 1, Picture Content 



Scale 2, Reading Content 

.92 

.97 

Essential Hijli ac/Jooi Uonteni Bauery 

Mathematics 



Science 



Social Studies 



English 

.86 

.90 


the two procedures varies from lest to test, being as small as .04 in 
one instance and as large as .14 in another. But in every instance, 
it is necessary to discount the odd-even correlation. 

HOW HIGH MUST THS eeUABIUTY OF A MEA$Ufi£W£MT BE? 

Obviously, other things being equal, the more reliable our mcasur* 
ing procedure is, the better satisfied we are with it. A question that 
is often raised is: What is the minimum reliability that is acceptable? 
Actually, there is no general answer lo ibis question. If wc must 
make some decision or take some course of action with respect to an 
individual, we tsIU do so ia terms of the best information wo have, 
however unreliable it may be, provided only that the reliability is 
better than zero. (Of course, here as always the crucial consideration 
is the validity of the measure.) The appraisal of any new procedure 
must always be in terms of other procedures with which it is in com- 
petition. Thus, a high-school maiheroaiics test with a reliability co- 
efficient of .80 would look relatively unattractive if tests with reliabili- 
ties of .85 to .90 were already available. On the other hand, a pro- 
cedure for Judeing "leadership” Iba! had a reibbillty of no more than 
,60 might look very attractive if the alternative were a set of uncon- 
trolled ratings having a reliability of .45 to .50. 

Although we cannot set an obsoluic minimum for the reliability of 
a measurement procedure, wc can indicate the Ic%cl of reliability that 
is required to enable us to achieve specified levels of accuracy in de- 
scribing an individual or a group. Suppose that we have given a test 
lo two individuals, and that indiviJual A fell at the 75th percentile 
of the group while individual B fell at the 50th percentile. I'^iat is 
the probability that A would still surpass B if they were tested again? 



190 


QUAUTfES DESfRED IN MEASUREMENT 


Table 7.7. Per Cent of Times Direction of Difference Will Be Reversed In 
SobsequenI Testing for Scores Falling at 75th ond 50lh Percentile 


Per Cent of Reversals with Repeated Test 


Reliability 

Coefficient 

Scores of Single 
Individuals 

Means of Groups 
of 25 

Means of Groups 
of 100 

.00 

50.0 

50.0 

50.0 

.40 

40.3 

10.9 

0.7 

.50 

36-8 

4.6 

0.04 

-60 

3Z5 

1.2 


.70 

27. 1 

0.1 


.80 

19.7 



.90 

8.7 



.95 

2.2 



.98 

0.05 




In Table 7,7 the probability is shown for diflereni values of the re- 
HabilUy coefficient. Thus, where the correlation is ,00, there is exactly 
a fifty-fifty chance that the order of our two Individuals will be re- 
versed. When the correlation is .50, the probability of a reversal is 
t in 3. For a correlation of .90, there is still 1 chance in 12 that 
we will get a reversal on repetition of the testing. To have 4 chances 
in 5 that our difference will stay in the same direction, we require a 
reliability of about .80. 

Table 7.7 also shows the situation when we are comparing two 
groups of 25. That is, in class A the average fell at the 75th percentile 
of some larger reference group, whereas in class B the average fell at 
the 50th percentile. We ask what the probability is that we would 
get a reversal if the testing were repeated. Here we still have a fifty- 
fifty chance when the correlation is .00, However, the security of our 
conclusion increases much more rapidly as the reliability of our test 
is increased. When the reliability is .50, the probability of reversal 
is already down to 1 in 20; with a correlation of .70 it is only 1 lO 
1000. Thus, a lest with relatively low reliability will permit us to 
make useful studies of and draw accurate conclusions about groups, 
but relatively high reliability is required if we are to have precise in- 
formation about individuals. 

mtABJLITY OF D/FFEKFNCB SCORES 

Sometmes we are less interested in single scores than we arc in the 
relationship between scores taken in pairs. Thus, we may be con- 




ftEllABtinY 


)p] 


Intelligence lest 


Error i 

Specific 

intelligence 

Common 

1 

1 factors 1 

factors 




i t I 

I < Read ng ten 

! f 1 


1 

1 

I 

1 

[ 

1 

1 

1 

) 


■ 

■ 

Subtraction leaves 

i j 

Error 

SpKifJC 


Specific 

Error I 


'*'s 

DjM«e/>ee score 

y' 

y 

Error 

Specific 

Specific 

tnw 1 



fig 7.}. Noiwre of « »<«'•. 


cerned with the ditTcrences between scholastic aptitude and reading 
achievement in a group of pupils, or uc may «ijh to study gains in 
reading from an initial test given in October to a later test given the 
following May. In these illustrations, the significant fact for each 
individual is the difTcrcncc between two scores. We must inquire how 
reliable our estimates of these differences arc. Inowlng the charactem- 
tics of the two component tests. 

It is, unfortunately, true that the appraisal of the dillcjencc between 
two tests usually has substanli.»lly lower reliability than the reliability 
of the two tests taken separately. This is due to two factors: (1) the 
errors of measurernent in both separate tests affect the difference score, 
and (2) whates'cr is common to both measures 5$ canceled out in the 
difference score. We can illusiniie the diuaiion by a diagram. l.ook 
at Fig. 7.2. 

Each bar in Fig. 7.2 represents performance • on a test, brolcn up 
into a number of parts to represent the factors producing thl« per* 

• More prcfjwly, variaore in perfOTtnanse. 






192 


QUALITIES DESIRED IN MEASUREMENT 


formance. The first bar represents an intelltgence test, and the sec- 
ond a reading test. Notice that we have divided reading performance 
into three parts. One part, labeled “common factors,” is a complex 
of general intellectual abilities that operate both in the reading and the 
scholastic aptitude test. A second pan, labeled “specific reading fac- 
tors,” is abilities that appear only in the reading test. The third part, 
labeled “error,” is chance error of measurement. Three similar parts 
are indicated for the intelligence test. Now look at the third bar, 
which represents the difference score. In this bar, the common factor 
has disappeared. It canceled out in our process of subtraction. Only 
the specific factors and the errors of measurement remain. These are 
the factors that determine the difference score. And the errors of 
measurement bulk relatively much larger in this third bar. In the 
limit, where two tests measured exactly the same common factors, only 
the errors of measurement would remain in the difference scores, and 
the differences would have exactly xero reliability. 

The refiabifity of the dilTerencc between two scores can be expressed 
in a fairly simple formula, which reads 


^ll + ^22 


roi//. 

1 ~ r,; 

where r,i is the reliability of one measure. 

ts is the reliability of the other measure. 

ri3 is the correlation between the two measures. 

Thus, if the reliability of test A is .SO, the reliability of test B Is .90, 
and the correlation of A and B is .60, for the reliability of the dif- 
ference score H’C have 


.25 

M 


= .62. 


In Tabic 7.8 the value of is shown for various combinations 
or values cl and r,.. Thus, if the averaec of the reliabilities 

o! our t»o tests u .80. the reliability of the dilTerencc 

score is .SO when the two tests hast zero intcrconclaiion, is .60 when 



RELUSlim' 


193 


Table 7 . 8 . ReliobiljJy of a Difference Score 

Average of Rdiability o( Two Tests 


Correlation between 
Two Tests 
(rit) 


50 .60 


(SldJs) 


70 .80 .90 .95 


.00 

.40 

.50 

.60 

.70 

.80 

.«0 

.93 


50 

.1? 

.00 


60 

.33 

.20 

.00 


.70 

50 

.40 

.25 

00 


SO 

67 

.60 

.50 

J3 

00 


00 

83 

.80 

75 

67 

.SO 

.00 


95 

.92 

90 

.88 

85 

7.S 

.50 

no 


the interconelaiion is .50. and is .00 when the rntercorreJatiod is .80 
It is clear that, as soon as the correlation between the two tests begins 
10 approach the average of their separate reliability coefficients, the 
reliability of the difference score drops very rapidly. 

The low reliability that lends to characterize difference scores is 
something to which the psychologist and educator must always be 
sensitive. It becomes a problem whenever he wishes to use test pat- 
terns for diagnosis. Thus the judgment that Herbert's reading lags 
behind his scholastic aptitude is a judgment (hat must be made a good 
deal more tentatively than a judgment about either bis IQ or his read- 
ing grade taken separately. The conclusion that Mary has improved 
in reading more than Jane must usually be a more tentative judgment 
than that Mary is now a better reader than Jane. Any difference needs 
to be interpreted in the light of the standard error of measurement of 
that difference.* 

Many differences will be found to be quite small relative to their 
standard error, and are consequently quite undependable. The inter- 
pretation of profiles and of gain scores are places where this caution 
especially applies. 

Efreers of uNaeiiABtury on correlation between vajuasies 
There is one further effect of unreliability which merits brief atten- 
tion here because it affects our interpretation of the correlations bc- 

*Tl.e srand Jtti error o! measuwiwnt N a dilTcrenw w.gi.ly wiuit to 
+ S‘r... where Sm, is the standard error ol owasurctnent ol one u-«t and 
is the standard error of mea^nrewent of the othw. 



Ql/AliriES OESfRED IN MEASURE/AENT 

tween diJIcrenl measures. Let us think of a measure of reading com- 
prehension and one of arithmetic reasoning. In each of these tests, 
the individual differences in score are due in part to “Iruc*" ability and 
in part to chance “errors of measurem«it.** But if the errors of meas- 
urement are really chance matters, the reading test errors and the 
arithmetic test errors must be uncorrclated. There is no relationship 
between one loss of a coin and a taler toss of a coin. So we have 
these uncorrclated errors in the total score. This means that the)' 
must water down any correlation that exists between the true scores. 
That is, the actual scores are a combination of true score and error, 
so the correlation between actual scores is a compromise between the 
correlation of the underlying true scores and the ,00 correlation that 
characterizes the errors. 

We would like to extract an estimate of the correlation between the 
underlying true scores from our obtained data In order to understand 
better how much the functions involved have in common. Fortunately, 
we can do this quite simply. Such an estimate is provided by the 
formula 



where U the conelation of the undetl>-ing ‘'true” scores. 
fii is the conelation of the obtained scores. 

Tij and r -3 are the reliabUities of the two measures in question. 

Thus, if the correlation between our reading test and arithmetic test 
is .56, and the rcliabilUy coefficients of the tests are respectively .71 
and .90, we have 

-56 

— /- . -= = .70 

V(.7I){-90) 

Our estimate is that the correlation between error-free measures of 
arithmetic and reading would be .70. In thinking of these two 
functions, it would be appropriate to think of the correlation as .70 
rather than .56, though the tests correlate only .56. 


FACTORS MAKING FOR PRACTICALITY IN 
ROUTINE USE 

Though validity and reliability’ may be all-important in measures 
that arc to be used for special research puiposes, when a test is to bs 
used in classrooms throughout a school or school system a number of 
down-lo-eaith practical considerations must also be taken into account- 



FACTORS MAKING FOR PRACTICAirTY JN ROUTINE USE 195 

It is easy for the administrator to pay too much attention to small 
financial savings or to economics of time that malce it possible to fit 
a test into the standard class period with no shifting of schedules, but, 
nevertheless, these factors of economy and convenience are real con- 
siderations. Furthermore, there are other factors relating to the readi- 
ness with which the tests may be given, scored, and interpreted that 
bear more importantly on the use that will be made of the tests and 
the soundness of the conclusions that vrill be drawn from them, 

ECONOMY 

The practical significance of dollar savings does not need to be 
emphasized. Dollars arc of very real significance for any educational 
or industrial enterprise. Economy in the case of tests depends in part 
on cost per copy. It depends in part on the possibility of using the 
lest booklets over again. From the junior high school on, and pos- 
sibly even in the upper elementary grades, it is feasible to administer 
a test using a separate answer sheet. Such a separate answer sheet 
permits reuse of the test booklets. If a test will be used in successive 
years or if testing can be scheduled so that diiTerent classes or schools 
will be tested on successive days, an important economy can be ejected 
by using the same test booklet over again several times. 

A second aspect of economy is saving of time in test administration. 
However, this is often false economy. We saw in the previous section 
that the reliability of a test depends on the length of the test. As far 
as testing time is concerned, we get about what we give. Some tests 
may be a little more efficiently designed, so that they give a little more 
reliable measure per minute of testing time, but, by and large, any 
reduction in testing time will be accomplished at the price of loss in 
the precision or the breadth of our appraisal. 

A third, and quite significant, aspect of economy is ease of scoring. 
The clerical work of scoring a battery of tests can become either bur- 
densome if it is done by the already busy teacher or expensive if it is 
carried out by clerical help hired for the purpose. A well-designed 
test should be planned so as to simplify and speed up the scoring 
operation. In tests for young children in the first two or three grades 
of school, there is not a great deal that can be done to streamline scor- 
ing procedures. Any attempt to separate the answers from the prob- 
lems, so that the answers will be more convenient to score, is likely 
to confuse the young child and alfect bis score. By the upper ele- 
mentary grades, however, it is practical to provide answer spaces at 
the side of the page, preferably the right, so that all answere appear 
in a column and can be scored by pladng an answer key beside them. 



OUAIITIES DESIRED IN MEASUREMENT 

.hc-t refcned to in an earlier paragraph, and 
The separate anstter shc.t reimc a further major 

also discuKcd and illustrated m ? lime-consuming turning 

economy in time. It comp > number correct, the test can 

seorina operation. Three main types should be noted. 

Curhon-llrreAed .nru-er S/reers. ^ 

In these, two sheets ate fastened * ^^Tten the examinee 

of one or both sheets are eos-eted with ^ insifla of 

marks in the answer spaces, the imtks ‘ upon 

the page by the carbon paper. The tnside has die key prm 
it, in the form of boxes or circles plattd opposi appear 

Seorina consists merely of counting the number of marks that ppe 

TfTmc, ^noeer S/teerx. These operate in essen.ialjy-h= -me 
way, except that a pin is pushed through " L to case of 

speeilied place. This technique is Hecord, 

a multiscore test. It has been used with to /.nder Pref e 
where the pin is pushed through several sheets of paper, ea n 

is printed with the scoring key ‘‘f “ ^00 

Counimg the number of holes appeanng wnhm the pnnlcd 
tic different sheets gives the score for the different areas of interest 
without the necessity of using key or stencil. 

3. The IBM Ansiver Sheet and Tesf-Scorsng Machine. 

ber of years the International Business Machines 
made available a test-scoring machine that operates electrical y - ^ 
the conductivity of pencil marks on a special answ'tr sl^t. 
has 750 anstser positions, which may be grouped in different u 
which most commonly represent 150 five-choice lest items. - 
swer sheets must be rather carefully marked wi* a soft pencil, p ^ 
erably a special one developed for the purpose, if they are to 
accurately. Various other mechanical difficulties have been en 
tered, for example, current leakage due to a damp climate. 
when these conditions are watched for, the machine can consi e^ , 
accelerate large-scale test-scoring jobs. The basic IBM machm~ 
be bought for'sSOOO.OO, or rented for S50.00 per month.* “niis rnsz ^ 
that the equipment must be used quite a good deal of the time * 


• I960 prices. 



f ACTORS MAKING FOR PRACFICAMIY IN ROUTINE USE IP7 

to pay tor itself. It is especially useful ta orsanizations having a large 
and fairly steady lloiv of test scoring. 


For large-scale teslrag programs, there are a number of arencies 
that maintain scoring services. These arc commented on further in 
the discussion of school testing programs in Chapter J 6. 


FEATURES FACILITATINQ TEST ADMtNtSTRATION 

In evaluating the practical usability of a test, one factor to be taken 
into account is the case of administration. A test that can he handled 
adequately by the regular classroom teacher with no more than a ses- 
sion or so of special briefing is much more readily fitted into a testing 
program than a test requiring specially (rained administMtors. Several 
factors contribute to the ease of giving and taking a test 

I. A test is easy to give if H has dear, fvll instructions. The in- 
structions for the administrator should be written out substantially 
word for word, so that aU the c.xaminer must do is read them and 
follow them. Instructions for the examinee should also be complete 
and should provide appropriate practice exercises. The amount of 
practice that should be provided depends upon how novel (he test task 
is likely to be for those being tested. Where it is a familiar type of 
task or a simple and straightforward instruction, no more than a single 
example will be needed. However, for an unusual item format or test 
task more practice will be desirable. 

2. A test is easy to give if the number of units to be separately 
limed is few, and close timing is not critical. Timing a number of 
brief subtests to a fraction of a minute is a bolbersome undertaking, 
and the timing is likely to be inaccurate unless a stop \vacch is avail- 
able for each tester. Some tests have as many as eight or ten parts, 
each taking only 2 or 3 minuies. A test made up of three or four 
parts, with"^ time limits of 5, 10, or more minutes for each, will be 
easier to use. 

3. The layout of the test items on the page has a good deal to do 
with the case of taking the test. Items in which response options are 
all run toselher on the same line, items with small or illegible pictures 
or diagrams, items that arc crowded together, and items that run over 
from one page to the next all make difficulty for the examinee. Print 
and pictures should be large and clear. Response options should be 
well separated from one another. AU parts of an item and all items 
referring to a sinsle figure, problem, or reading passage should appear 
on the same page or double-page spread. Shortcomings on any of 



OUMITIES DESISEB IN MEASUREMENT 

these points represent black marks against a test as far as ease of tabng 
it is concerned. 

fEATURES rAatrlAriNG INTERRREIAT/ON AND USE OE 
It seems axiomatic, though the PoW U someumj overto . 
a testis given to be used. IMhe score is to ^ 

preted and given meaning. The an o . . r that permits 

the responsibility of providing the user with ■" ^.Jjs and 

him to make a sound appraisal of the test ■" ^ Uiey 

to cive appropriate meaning to the score o ,, , t rmterials that 
du primary through die ,es, mnnunf and other 

are prepared to accompany the te^ What nia> 5 uppott- 

ably expect to find in the manual for a test, to- the test 

ing materials? We have outlined beloiv the aids «e behese the test 

user should expect, 

.1. .4 smtemem d th. Functions the Test Il'or 0 «gned m 
and oS the General Procedures by Which It If’as ' 

the author s statement of v.hal he considers the test to be 
the evidence that proper steps have been taken to orimaifly 

Particularly for achievement tests, m uh.ch «e are 
with content and process validity, the author should tell Ljj 

cedutes by which he arrived at his choice of content 
of the functions being measured. If he is unwilling to P 
thinking to our critical scrutiny, we may perhaps be skepti 
thoroughness or profundity of that thinking. ranee 

Procedures involve not only the rational procedures by w ic . 
of content or types of objective were selected, but also the emp 
procedures by vshich items were tried out and screened for nna* 
elusion in the test. ^ 

^2. Detailed Instructions for Administering the Test. >ve na 
cussed in an earlier sanion the need for this aid to uniform an 
administration by the teachers or others who will have to u» t - 
va 3. Scoring Keys and Specific Instructions for Scoring the Test. ^ 
problems of scoring have also been discussed, under the 
economy. The manual and supporting materials should prosn s ^ 
tailed instructions as to how the score is to computed, how e ^ 
are to be treated, and how part scores are to be combined into a o 
score. Scoring keys and stendls should be plaimed to facilitate 
much as possible the onerous task of scoring. , 

v-4. Korms for Appropriate Reference Groups, together with 



FACTORS MAKING FOR PRACTICAUIT IN ROUTINE USE IW 

mation as to how they were obtained and instructions for their use. 
Chapter 6 was devoted to a full considciation of types of test norms 
and their use. It will, therefore, be sulTicient at this time to point out 
the responsibility of the test ptodneer to develop suitable norms lor 
the groups with which his test is to be used. General norms arc a 
necessity, and norms suitable lor special types of communities, specia 
occupational groups, and other more limited subgroups will add to the 
iKiefulness of a test in many cases. . , ■ 

^5 Evidence as to the Reliabiluy oj the Test. This evidence should 
indicate not only the bald reliability statistics but also the operations 
used to obtain the reliability estimates and the dcscnpme and s a 
deal characterisUcs of each group on which f^^mble rtm 

If a test is available in more than one form, it is highl, .j-.j _ 

L p oducers report the correlation between the two fo™- ' “ “ 

.0 any data that were derived from a -"S'" 
part scores, and particularly if « reported for the separate 

these part scores, to report standard 

pan scores. It is good -n.bilitv cocmcicnls. An author 

errors of measurement as we measurement is at each of 

who indicates what the standard er commended, since this 

a number of score Ifs ae- 

information shows over what range of 

on The :i£::vz 

vides several subscores, - rtanUn guiding the interpre- 

inlcrcorrclations of these. “ ! , , i„ jugjing how much confi- 
talion of the subseores and Frtw 1_ are 

dence to place m ^ measuring much the same things, 

correlated to a substantial degrre, 5 . 

die differences between them wiU be hugely 

icrprctablc. , j nf the Test to Other Taciors. In 

Jl. Evidence on the ^ device, conelations ^uth 

so far as the test is to be on how well it does 

criterion measures eonstilulv ^ ^ „,acd on the nature of 

in tact predict. Full da, a arc available, and .he 

the criterion were obtained. Only then can the 

conditions under which ^ predictor, 

reader tair.y ''^‘'irrelations with other me 


aaer lum.? j— o- ' . 

It will often be desirable 0 


of the same function as 


Die - . 

coUatcral esidencc beann. 


the validity of 



205 QUALITIES DESIRED IN MEASUREMENT 

th. test Thus, correlations with individual intelligence test score are 
relevant in the ease of a group intelligence ' 

Finally, indications of the !='f ‘ the indi- 

of community, socio-economic level, and “ . 

vidual or the group are often helpful. They provr ^nd to 

how sensitive the measure is to the background of g P 
circumstances of their life and education obtained 

^ 8. Guides jar Using the Test and for Inlerprcltm Results 
with It. The developers of a test presumably know ‘‘ 
able for the test to be used and the results from 
They are specialists in that test. For the test to sue<>es- 

others, especially the teacher with limited specialized 
tions should be given of ways in which the test * "‘J . 

for diagnosing individual and group weaknesses, 
ings, organizing remedial instruction, counseling wit 
or whatever other activities may appropriately be based on P 
ticular type of instrument. 


SCHEDULES FOR EVALUATING A TEST 

The potential user, who is trying to select the best test for a 
lar purpose, might welcome a standard form or procedure for , 

ing the various tests that are candidates for his patronage. A stan 
and somewhat objective procedure for rating tests would be v ^ 
attractive if an appropriate one could be devised. There have '' 
several attempts to apply the technique of quantification to tests t e 
selves, and score cards have been developed to be used in appraisi o 
tesls.'*'®-^ These allocate so many points to aspects of validity, 
many to factors associated with reliability, so many to ease of use an 
interpretation, and so forth. 

One can question how useful this standard scheme of adding P 
points is in this situation. Certainly, if a test has low validity, no 
amount of elegance and polish in other respects can make it a sa i 
factory instrument. And the importance of different qualities or a 
measure varies, depending upon the purpose for which the instrume 
is to be used. For that reason, we are not proposing any numerica 
scheme for arriving at a score on each test being considered. 
ever, a systematic outline should help in assuring that the signmca^ 
factors are all taken into account and that the analysis is organize if* 
such a way that comparison of different tests will be facilitated, 
schedule given below provides such an outline. If answers are soug 
to all the questions raised in the outline, the potential user should have 



SCHEDULE FOR tVAlUATlNG A TEST 

a good basis for comparing the suitability for his needs 
available measurement devices. 


20t 

of different 


An extensive and analyticaUy critical set of criteria for an accept- 
able psychoio-ica! test has been developed by the Committee on Test 
Standards of the American Psychological Association, and published 
by the Association. This article gh-es a full statement of the standards 
that a commercially distributed test may be expected to meet. Similar 
standards for educational tests have been prepared by the American 
Educational Research Association and the National Council on Meas- 
urements Used in Education.* 


SCHEDULE FOR EVALUATING A TEST 

GEWE^At REfftfENCE INFORMATION 

1. Name of test. 

2. Author’s name (and position, if available) 

3. Publisher. 

4. Date of publication. 

5. Cost. 

6. Time for administration. 

VAl/O/TY 

A. Evidence from the Plan for the Test. >Vhat were the proccdurM for 
determining the scope of the test? For determining the particular content 
to be covered? For determining the functions and processes to be repre- 
sented? How adequate do these appear to be? How closely do the test 
objectives correspond to objectives that you arc interested in for your 
school? 

\Vhat provisions were made for editorial review of the test materials? 
How adequate do these appear? 

B. Evidence from the Test Blank Itself. Do the test items appear appro- 
priate for the objectives that .vow arc trying to evaluate? Do the test items 
appear to be well constructed? Arc they free from ambiguity? Do they 
have attractive wrong-answer choices? 

C. Evidence from StatMcaJ Studies of the Test In Use. ^v^^h what 
concurrent measures has the test been correlated? For what sort of 
groups? How substantial arc the correlations? 

With lihaf Ja/cr criterion measures has the test been correlated? For 
what sorts of groups? . / i 

How does the evidence on statistical validity compare with that tor other 

How accurate a prediction docs it ^ve of significant oocstdc criteria? 
How’ do these results compare with those of other tests that try to measure 
the same trait? . . . 

D. Evidence from OtHslde Aushoriiy. \Vhat have reviewers and cnti« 
said about the validity of the test? 



202 


QUAIITIES DESIRED IN MEASUREMENT 


RELIABILITY 

A. How Adequately Are Data Reported? Do the authors jndicalc size 

and nature of groups for which data arc reported? Do they indicate tjT^ 
of reliability cocfTicicnt computed? Do they give mean and standard de- 
viation for the groups? Do they report reliabilities for single age and 
grade groups? , 

B. U'/iai Are the Facts on ReFabiUty? What actual data on reliability 
are reported? {Indicate, as far as given, the age or grade, size of group, 
mean and standard deviation, procedures by which reliability was com- 
puted, and resulting values obtained.) How do the data compare with 
other competing tests? 

PRACTICAL CONSIDERATIONS IN ADMINISTRATION AND USE OP TEST 

A. Factors in Administration 

1. Adequacy of manual. 

2. Complexity of procedures. 

a. Complexity of process required of students. 

b. Adequacy of instructions and practice exercises. 

c. Complexity of process required of examiner. Timing, giving 
instructions, and interpreting responses of subjects examined. 

3. Time requirements. 

4. Legibility, attractiveness, and convenience of format. 

B. Factors In Scoring 

1. Time required (i.e., form of answer, t>'pe of key, etc.). 

2, Special skills required (subjective scoring and qualitative inter- 
pretation). 

C. Factors in Interpretation 

1. Type of norms. Appropriateness to uses, completeness, repre- 
sentativeness of sample. How readily may raw scores be con- 
verted into derived scores? 

2. Aids to interpretation provided by manual. 

D. Factors in Continued Use 

1. Are there comparable forms? How many? How well is com- 
parability established? 

2, Cost. Does this permit routine continued use? Can blanks be 
used a number of times? 

'^SUMMARY STATEMENT 

We have discussed the requirements of a good test under the head- 
ings of validity, reliability, and practicality. A test is valid in so far 
as it measures the qualities we wqsh to measure. It is reliable in so far 



SUMMAJOf STATEMENT 203 

as It measutes with precision. It is practical in so far as it is economi- 
cal of time and money and simple to give and interpret 

crucial requirement for a test is validity. In some tests, espe- 
cially achievement tests, we may have to judge how well the test repre- 
sems the content and processes we wish to measure. For other tests, 
especially aptitude tests, we may evaluate how well the test predicts 
some measure that serves as a later criterion of success. In still others, 
where we arc interested In the test as describing some trait or aspect of 
the individual, appraisal of validity is more complex. A “theory" of 
the trait or construct must be clev’eloped. and the lest is evaluated by 
how well it fits into the pattern of relationships that would be predicted 
from this theory. 

There are several different procedures available for obtaining esti- 
mates of the reliability or precision of a measure. The most rigorous 
procedure is to administer two equivalent forms of the test on two 
separate occasions. The correlation between the two forms prDvidc.s 
a reliability coefficient that tells how closely individuals maintain their 
position in the group from one testing to the other. Less exacting pro- 
cedures include (1) repetition of (he same test and (2) extracting two 
scores from a single test, usually by scoring odd and even items sep- 
arately. Reliability estimates based on these last procedures are less 
satisfactory and should usually be discounted somewhat. 

The value obtained for the reliability coefficient will depend on llic 
range and level of ability in the group tested and the length of the 
test, as well as upon the particular procedure u<ed for estimation. It 
is particularly necessary to discount a coefficient based on the pooling 
of several grades. 

To describe the accuracy of an individual’s score, the uantlard error 
of measnremeru is often preferable to the reliability coelficicni. U tells 
the variation to be expected if wc were able to make repeated meas- 
urements of a particular individual. This x’ariation must always be 
borne In mind when interpreting the score an indisidual rccehcs 
Practicality is a function of economy, ease of administration, and 
readiness of incetpretation. Economy is affected by initial cost, by the 
possibility of reusing materials, and by lime required for scoring and 
analyzing the results. Ease of administration results from full direc- 
tions, simple procedures for the examinee, and an objective record 
of performance. Readiness of interpretation is facilitated bj- good 
norms and by a full guide of suggestions for interpretation. 

The potential test user should examine the tests from among whicft 
he must choose in the light of the abow criteria and pick the one that 
best fils his needs. 



204 


QUAiniES DESIRED IN MEASUREMENT 


references 

1. American Educational Research 

Measurements Used in Educatmn. Committee ™ * «' O c„ Na- 

„!cal reccmmcndalicn. lor achevemenl tests. Washington, 

tional Education Association, 1955. . -f„t Standards, 

2. American Psychological Association, diagnostic tech- 

Technical recommendations for psychological tests and oiag 

niques, Psychol Bull., 51, No. 2, igyel unpublished 

3. BLer, R. A., Flicker fusion threshold and anxiety level, u p 

doctor’s dissertation, Columbia University, ■ standardized 

A role R D and F. von Borgersrode, A scale for rating sia. 

StI; Sch oI Educ. Bee. ol Univ. of North Dakota. 14, 1928 (Oct), 

5. McClelland, David, John W. Atkinson, 

Lowell. The Achlevemenl Motive, New York, Appleton Ce 

6. Ot” A. S., 5cnle lor rat, ns tests. Yonkers, N, '"d°Sd 

7. Rinsland, H. D., Form for briefing and evaluating standardize 

8. Terman.^^U MrandlilVud A^’Merrill, Measuring inleVigence. Caro- 
bridge. Mass.. Houghton MiHlin, 1937. 

SUGGESTED ADDITIONAL READING 

American Educational Research A^ation and P 

Measurements Used in Education, Eechnica recammendowns J 
achievement tests. Washington, D. C., National Education Associa 

American Psychological Association, Committee on Tmt 

nical recommendations for psychological tests and diagnostic tec q 

B«H., 51, No. 2, 1954, PL 2. Oi/- 

Bennett. George K., Harold G. Seashore, and Alexander G. 059 

ferential Aptitude Tests manual. New York. Psychological Corp.. 
Chapters 4 and 5- . j j x? York. 

Cronbach, Lee J., Essentials of psychological testing. 2nd ed., r>e 

Harper, 1960, Chapters 5 and 6. p. , 

Cureton. Edward E., Validity, Chapter 16 in E. F. LindquisL Editor, 
cational measurement, Washington, D. C., American Council on 
cation. 1951. . „ n Ho. 

Doppelt. Jerome E., How accurate is a test score? Test Service Bulletin 
50. New York. Psychological Corp., 1956. . , 

Harris. Chester W,, ^itor. Encyclopedia of educational research. 3ru 
New York. MacmilUn. I960, pp. 1038-1047. 1551-1554. 

Thorndike, Robert L., Rcliabililv, Chapter 15 in E. F. Lindquist, fc * • 

Educational measurement, Washington, D. C., American Counci 
Education, 1951. .-a 

Wesman, Alexander G., Expectancy tables — a way of interpreting test v 
ity. Test Service Bulletin No. 38, Psychological Corp., 1949, 1—5. 



QUESTIONS fOft OJSCl/SSfON 


205 


QUESTIONS FOR DtSCUSSION 


Board were developins a gcn- 


a. Scores on Personality Test X combled +.43 with teachers’ ratines 
of adjustment. 

b. The objectives to be aK>ra}sed by Reading Test Y w-ere rated /or 
importance fay 150 classroom teachers. 

c. Scores on Clerical Aptitude Test Z correlated +.57 with supervisors’ 
ratings after 6 months on the job. 

d. Intelligence Test W gives scores that correlate +.69 with Stanford- 
Binef IQ. 

e. Achievement Battery V is based on an analysis of 50 widely used texts 
and 100 courses of study from all pans of the U. S. 


3. Comment on the statement “The classroom teacher ts the only one 
who can judge the validity of a standardized achievement test for his doss." 

^ 4. Loot: at the manuals of two or three tests of different types. What 
evidence on validity is presented? How adequate is it for each lest? 

5. Using Table 7.3 on p. 171, determine what per cent of those selected 
would be above average on (he job if a selection procedure with a validity 
of .40 were used and only the (op quarter were accepted for the job. What 
per cent would be above average if the top three-quarters were selected? 
What (vould the tn'O per cents be if the validity were .50? What does a 
comparison of the four percentages bnng out? 

6. Air Force personnel psychologists are doing research on the seteC'- 
tion of jet-engine mechanics. \Vhat might they use as criterion measures 
of success as a mechanic? What are the advantages and limitations of 
each possible measure? 

7. What advantages and disadvantages do school grades have as cri- 
terion measures? 

8. A test manual contains the following statement: ‘The validity of 
test X is shown by the fact that it correlates .80 with the Slonford-Blnet.'" 
What additional information is needed lo evaluate this assertion? 

9. Look at the evidence presented on reliability in the manuals of two 
or three tests. How adequate is it? ^Vhal are its shortcomings? 

10. The manual for test T presents reliability data based on (a) retesting 

with the same test form a week later, (b> correlating odd with even items, 
and (c) correlating form A with form B, the tw-o forms being given a 
week apart. Which procedure may be expected to yield the lowest co- 
efficient? Why? ^Vh^ch to yield the most lue/ul estimate of reliability? 
Why? . , 

11. A student has been given the S/anford-Binet Intelligence Test four 
difTcrent times during his school career, and Ws cumulative record card 
shows the following IQ’s: 98. 107. lOI, and 95. What significance should 
be attached to the fluctuations in IQ? 



206 QUAtlHES DESIRED IN MEASUREMENT 

12. A school plans to give form A of a reading test in October and 
form B in May, In order to study individual differences in improvement 
during the year. The reliafaffity of each form of the test is Icnown to be 
about .85 for a grade group. The correlation between the two forms 
turned out to be .SO. How much confidence can be placed in the “gain” 
scores? 

13. You are considering three reading tests for use in your school. As 
far as you can judge, the three are equally valid. The reliability of each 
is reported to be .90. What else would you need to know to make a 
choice among the tests? 

14. Examine several tests of intelligence or of achievement that would 
be suitable for a class you are teaching or might leach. Write an evalua- 
tion of one of these tests, following the guide on pp. 201-202. 



Chapter 8 

T 

Wliere to Find Information 
about Specific Tests 


THE NATURE OF THE PROBLEM 

The production of educational and psychofogical tests has been go- 
tng on /or only half a century, but daring that time Jftcrally thousands 
of different tests have been produced. In a comprehensive biWiop' 
raphy which covered up to about 1945, Hildreth *■"> included entries 
for 5294 different tests. The number could probably now be increased 
by at least another thousand. Of course, many of the earlier published 
tests are obsolete now, or only of historical inierest, but the number 
of currently available tests is stUl very great. 

Not only is the total number of tests great. So also is the variety. 
Tests vary widely in testing procedures, in content, and In group for 
which designed. There are papcr-and-pcncil tests, individual per- 
formance tests, rating scales, self-rating procedures, observational pro- 
cedures, and projective techniques. There arc measures of attitude, 
of interest, of temperament, of personal adjustment, of intellect, of 
special aptitudes, and of ail aspects of school achievement. There arc 
tests designed for infants, for preschool children, for school children 
and adolescents, and for adults. 

No one book can hope to introduce a student to even a representa- 
tive sampling of tests of all types, covering all sorts of content for all 
age levels. The following chapters win introduce some of the most 
important and most widely knowm tests, dhcusslng them as examples 
of many others. But this book cannot give a complete treatment of 
any particular age group or subject area, and there arc so many special 
.situations in which a reader may be interested or for which he may 
need a lest that the tests discussed here may include not even one that 
fits his particular need. 

Since it is impossible to list and evaluaie all or even most of the 
tests that might be of concern to an audience with varied interests, wc 
J07 



SOURCES OF TEST INFORMATION 

shaU approach the rvalhb” hs“ttd' 

the reader to sources m which he can nno me , • 

and in some cases evaluated, and we will try to guide ^ «ader m 
evaluating the tests he locates. The present chapt 
source materials for finding tests and for finding out *°u ^ 

Chapter 7 has given an orientation in the factors o = purpose, 
evaluating the suitability of a particular test for a ^ 

The knowledge of where to go to find out about tests P 
type and how to evaluate one when found is probably 
than predigested information about a particular test .pHuipate 

and the purposes of the test user change. It is impossible to anticipat^^ 
what type test will be required for some future need. P 

thing is to know how to go about Buding the tests available for that 
need when it arises and how to evaluate their relative ments. 

There are several different types of questions about a test or 
area of measurement for which one may seek answers. Some o 
types of questions are: 

1. VAiat tests have been developed that might serve my present need 
or purpose? 

2. What are the new tests in my field of interest? 

3 What is test A, of which 1 have heard, like? For what ^oup 
and purposes was it designed? Who made it? How long does it taw 
and how much does it cost? What skills are needed to give and use n 

4. What do specialists in the field of measurement have to say a o 
test A? How do they evaluate it, in comparison with competing tec 
niques? 

5. What basic factual material do we have on test A? What ar ^ 
statistical attributes? What are its relationships to other measures. 

6. What research has "been done studying or using test A? 

Let us see what materials are available to us as we try to ans\.er 
questions such as- these. These resources include (1) text and re 
erence books in special areas of testing, (2) the Mental Measurements^ 
Yearbooks, (3) test reviews in professional journals, (4) publishers 
test catalogues, (5) each test itself together with its accompanpn^ 
manual, (6) articles in professional journals reviewing a broad fi^ld o 
testins, (7) comprehensive bibliographies of tests and the testing Id 
erature, and (8) educational and psychological abstract and 
series. These will be considered in turn, the most useful items will 
identified, and the information to be obtained from each type of source 
will be indicated. 



TEXT AND REFERENCE BOOKS IN SPECIAL AREAS 


209 


TEXT AND REFERENCE BOOKS IN SPECIAL 
AREAS 


There are a number of text and reference books coverins more 
specialized areas of testing. When the scope is limited lo include only 
elementary-school tests, tests for diagnosis of individual maladjust- 
ment, or tests for vocational placement, it becomes possible to cover 
the field in more detail. A book dealing with tests of a particular type 
provides a good general introduction to the materials of the ficid. Such 
a book usually acquaints the reader with a representative selection of 
established tests in the area — (hose which the author considers worthy 
of mention. In addition, some evaluation of each test is usually given, 
indicating the purposes for which it may well be used, and what the 
writer considers to be its strengths, weaknesses, and distinctive charac- 
teristics. The book will usually also contain some discussion of the 
problems of testing in the field it covers, apart from discussion of spe* 
cific tests. 

Tt is not possible to consider all the books that might proi'c useful 
to some reader. However, a number of them have been listed below 
with brief annotations. The titles have been chosen in terms of their 
recency and the quality of their treatment. In addition, an attempt 
has been made to get books that represent a wide range of specialized 
interests. The annotations arc designed lo bring out the distinctive 
quality of each book. 


Allen, Robert M., Personaliiy assessment procedures. New- York. Harper. 
1958. Surveys methods and techniques for evaluating personality, and 
is a source book of tests and Instruments used to assess personality. For 
each test, the purpose, reliability, validity, and standardization proce- 


dures are given. 

Amy, Clara Brown, Evaluation in home economics, New York, Appleton- 
Century-Crofts, 1953. Although the examples given in the first half 
are related to home economics, the otcellcm discussion of purposes and 
methods of evaluating student progress are applicable to any class. 
Commercially published standardized tests, check lists, and rating scales 
ore described and uses Indicated in an appendix. 

Bauman, Mary K.. A manual of norms for rent timl in countelr/Tir fiMa 
persons, AFB Publications. Research Scries, No. 6. New York. Amcncan 
Foundation for the Blind. 1958. Information is given on tests that havx 
been adapted or developed for use with the adult blind. Also includes 
bibliography of source materia! on tests, testing, and lest intcrpreution 
for adult blind. . u 

Blair Glenn M.. Dmnosilc and remedial teachini;: a piiiJe to ^ 

elementary and serondary schooh. rev. ed- New York. .Macmillan. 195fi. 
Giscs selected lists of tests which Blair judges to be of value for diag- 



SOURCES OF TEST INFORMATION 

nosU of difficulties in the basic "a ways of using. 

Mealing strengths ' d 

Bond. Guy L., and Eva 0°"^ ^nerW ^ infomtahon on read- 

cSfa-sirs-^ 

eA,cal!on. 3rd ed., En8'=»™<' 

-s;.rSitrszzT ^ J£ 

cation. , ,, _ o Urt.rt Guidance tesling. 3rd ed.. 

FroehUch, Clifford P., and en _ ^ Chapters in book are de- 

Chicaeo. Science Research Associates. jingle aptitude 

sr™ ... 

Provides lists of achievement and prognostic tesU avaua 
Su"n"ald E.. and John O. Crites, -^^^f^^^'ffwIdfvSetfo* 

Vrtrtr Hamer in press. Reports on selected tests m a wiuc 

Mds' that may be used in educational or vocational guidance, 
one limitation of books, such as those just annotated, 

parent from an examination of ^os, recent 

that these were selected (1960), each was judged to be the most 
good book in its field and yet some were already eight years o . 
one adds to this the time that has elapsed in the 
printing of the book, it is easy to see that a book reviewing 
Ml be relied upon for current materials. The typical textb & 
information about well-established and accepted tests, u 
published devices or techniques that are still in the expenmen o 
are not likely to be represented. There is a lag of several y _ 

tween production of a device and the reporting of it in boots re 


ing an area of tesiing. . 

Another feature of most books surveying a field, which may 
some cases an advantage and in others a disadvantage, is that they 
selective. They must be. The author cannot discuss everything, 
must pick the items he wishes to present. He selects for discussion 
tests which he considers valuable. In so far as his judgment j’ 

he does a real service to the novice in the field, who is thus led dire 
to the more important and valuable material. However, this 
that the reader cannot expect to use a textbook as a source to lea 



THE MENTAL MEASUREMENTS YEARBOOKS 211 

to all the tests in an area and permit him to compare them. For a 
full listing of the tests of any particular type he will have to look else- 
where. 

the mental measurements yearbooks 
Probably the most useful single retereitco source for the person need- 
i„g'^:"choices and plan programs in the 6=^ » 
scries of Menial Measurements Yearbooks P''P“' J oreceded by 
Sve Yearbooks have now -e- Pobis^ded and they 

Serproritrmtrrnd one'or more and eridcal reviews of 

is;; 

these volumes, each reviewer c S general 

in which ho is presumed more reviewers, 

interest are appraised by strengths and weaknesses of a 

The "Views are fairly full, ^'"b - J ,he purposes 

test, comparing tt wtth others in he ne 

tor which the reviewer “,“fp„rl,oo»s also include the fac- 

I„ addition to revmws L is likely to need^uch 

lual hems about ^^ch test t time to adrmnister, 

items as author, publisher, p ( („,ms available. Finally, 

grades lor which suitable, “"f pfcy of books and articles 

tor each test the Yearbooks give a b onog p r 
that have appeared dealing the case of one test to 2297 

raphics arc quite eatensive. amounting m 

“The Yearbooks have 

die test user. One is a “.i^iXakes to list all the sj- 

measurement problems. covered and in a ■ 

niScant books on measurement to ^„e 

^ives excerpts from the "v'J* bibliography and reviews 

psychological and educatio 1 yications in the field, 

nrovide a guide to, and a„d directory section. This 

Also valuable is a very “"P'' y ^ publishers of the tests an 
includes (1) a tHrevtury ^J ^ y i volume, (2) adir ^ 
of the books on meesnre'n'" ,s of tests o 

“"\“''«irOr™i”txof.mesotbo^ 

S oVn”aSro^cmri„g in “jrS’ele WicL make it po-ible 

rf tests organixed by content^ We^^^ „„yc.= original of any 

to locate any test or type of test, l 



SOURCES OF TEST INFORMATION 

WTicn a question arises about a te > reaches al- 

M..,,ren.en,sy^ 

most auiomalically- I n-y *,»,♦- nr te5t5n'’. 

that must answer frequent ^“=“^7 if oL Mihes to cover 

The Yearbooks are not too “7’ At the present writ- 

early as weU as current tests in a , 949 ^ , 953 , and 

in», there arc five of them, published in 19 , • ^ jve 

',959. TO cover the tests 

volumes and may in fact need I® = 1 , n t Yearbook that came out 

A new test is ordinarily reviewed in the first ,„b- 

after it was published, and reviews may the 1938 

sequent volumes. Space limitations did not permit 
Yearbook of all the older tests that were “ 7“ Even the set 

reviews of some of these first appeared m in its 

of volumes tahen together does not undertake ,r,„ether the 

coverage of tests of a given type. However, ^robabfv find an ap- 
material in the complete series, the reader wall ^ fsbed up to 

praisal of any test that he is likely to consider usin^. p 
the time that planning for the last Ye^boakvas 'a 'covers the 
two Yearbooks cover tests up to about ' .^Itrful from 

period from 1940 through 1947; the fourth "7 
1948 through 1951; and the fifth brings us up to 1958. 


JOURNAL TEST REVIEV/S 

We still face the problem of getting information on the /alesl te^ 
and testing developments. One way of kKping up with '“C^rest 
new tests is through reviews in professional journals. Tests o 
to the psychologist and the counselor have been reviewed for a nu - 
of years in the Journal oj ConsuU'mg Psychology and the lou 
Counseling Psychology. In late 1959, a test review sccuon, a 
“Tesfmg the Test,” was initiated in the Personnel and Guidance 
nal. These sources should keep the test user up to date on the^ m 
significant new psychological tests within a year or so of their p- 
pearance. 


TEST PUBLISHERS 

The most up-to-date infonnaiion on what tests are available is 
ably to be obtained from the test publishers themselves, either thn5ur 
correspondence or through their catalogues. There are many 



2T3 


TEST AND MANUAl 

lishers, too many to list here, so that galhtrinj inloraiaiion from all 
of I tem would be quite jm undenating. However, the number who 
publish MrenirVefy in the testing field B a good deal more limited. 
A number of the most impoitant publishers are listed in Appendijt IV 
together with their addresses and some indication of the types of mate- 
rial and the services they supply. 

The limitations of a test publisher as an entirely unbiased source of 
information on the values and limitations of his oivn publications are, 
of course, obvious. Reversing Marc Antony, we may say he comes 
to praise his tests, not to buiy ihem. However, as a source of infor- 
mation about, rather than evaluation of his tests, he can be very help- 
ful. In Chapter 7 we have considered how the potential user may go 
about appraising a new test for himself in the light of the information 
he can get from the test producer and from other sources. 

resr and manual 

The individual who is seriously considering using a particular test 
will certainly need to cwminc the lesi itself and the manual the pub- 
lisher has prepared to go with it. Each publisher’s catalogue will indi- 
cate the price for which a specimen set of each test may be obtained 
The specimen set contains a copy of the lest itself, the instructions for 
administering and scoring, and part or all of the supplementary mate- 
rials available to the user to help in interpreting the test. 

The amount of supplementary materials included in a specimen set 
varies from one publisher to another. The potential user can legiti- 
mately expect the publisher to include materials in a specimen set 
that will provide all the information be needs in order to arri\’e at a 
decision as to the suitability of the test for his purposes. He should 
be skeptical of any test for which the information supplied him is 
incomplete. The individual who wishes to examine a number of dif- 
ferent tests without buying specimen sets of each may be able to find 
a test file in the library or the guidance department of his local uni- 
versity. 

To obtain specimen sets of tests, the applicant must ordinarily pr^ 
sent some sort of credentials. A letter on the official letterhead of his 
school or institution will often suffice. A note from the university 
where he is studying may serve the function. The limitations that 
publishers place upon the distribution of their materials depend upon 
the nature of the materials. They will often refuse to distribute tests 
that require special skills to admimster and interpret unless the ap- 



SOURCES OF TEST INFORMATION 


... iUUKCW ' 

™ ,i« 

— So. 0. *. “ rrsstsst. 

user with a basis for judging “ objectives and functions he 
form of test exercises correspond "bjecj 

svishes to measure. The of any test, 

lectively called the test manual, i from one test 

It varies enormously in f f ![ collateral material 

to another. In some important in- 

becomes almost a book. It provi . . ^ ^ye j,ave indicated 

formation to help in using ='"‘' .“ '“^.lio„ a test user has 
in Chapter 7 Cpp. 202-203) the ‘yP“ "'T“ '°Ll that provides 

a right to expect to find m the test ma . information 

all This information becomes a very important source 

about the test. . „i,.„.ivencss but also in 

Manuals diflet greatly not only m . ,j„ly tao ■>' 

impartiality and integrity. Probably no becomes to 

a promotional element. However, sometimes the manual o^^^ 
a very large extent a promotional device focus . 

:ales^of the test, -me potential user appropriately 

aspect of the manual and must endeavor to d'^oun PP _ 

clLs made for the test. There often appear to be an mve 
tionshlp between the grandeur of the claims that a d^ 
evidence on which they are based. The reader wi 1 
attention focused on the evidence presented in 'idous of 

claims in the light of this evidence, and to be ^cry little 

the test whose manual makes sweeping claims but presents 
data. 


JOURNAL REVIEW ARTICLES 
It is sometimes useful to refer to summary articles oo/vnoS 


recent 

recent 


developments in tests and testing. The most regular of 
years has been the triennial summary m the Review of Edii _ 
Research. This journal undertakes to summarize research in a 
ber of different sectors of education. Its publication schedu 
ranged so that a given area is treated every 3 years. Material 
and measurements was reviewed in the February, 1959, 
was devoted to educational and psychological testing. Sirai 
views appeared in 1956, 1953, and every third year back to 
Because of the volume of material to be covered, these revies 
very condensed, but they do introduce the reader to new tes s 



ABSTRACTS AND INDICES 215 

testing research and provide him with a bibliography of original ref- 
erences to which he can go for a fuller report on any topic in which 
he is interested. _ . , ^ , 

Since 1950, the Annual Revie^v of Psychology has provided a yearly 
review and bibiiographj on selected psychological topics. Chapter 
headings such as "Individual Differences” and “Theory and Tech- 
niques of Assessment" suggest sources for material of possible interest 

“’if’-rlyTs t mL'yl^its prepared an annotated hibbo.ap^y on 
reading which has appeared in recent years in the Io:,rnal “ 
tiond Research. This deals with reading tests-as well as with other 
reading problems. 

COMPREHENSIVE BIBLIOGRAPHIES ON 
TESTS AND TESTING 

niere have been, from time to time. “s “Tm- 

o„ tests and testing. However, the 

dreth-... is badly out !;“;Xvirg in tests 

is of interest primarily to a perso publication \\as 

of a given type. Its “P a reference 

quite complete. The bibliograp y information about it. 

source •for «eh test, 

One fairly extensive bibliog p ® ( pliability, item analysis 

of testing Ci.e.. sueh issues as ,„d Kavruek.- This 

techniques, etc.) has covering the period from 1929 

source provides over 250 

to 1949. useful primarily to the person 

These extensive tests of some particular type, or 

who wants to dig faidy 'i"* 
into some technical testing problem. 

abstracts and indices 

C . he bioucht to the attention ol the sen- 
Two Snal Abstracls and the Edacahor, ladex 

ous student are in the fields of psychology and 

These are basic bibhogr P . to provide a complete listing 

education respectively. Ea ^ -phe fidd for the Pjy- 

of current publieatmns m „ 3 „„„ly defined, heing 

cholos'ical d^drucD ‘S nublications in psychology. Each 
to scientific and techni P Ijy abstract m i 

lion is represented not merely ny 



SOURCES OF TEST INFORMATION 
tag ih= nature of *= 

index and author index aid m . f tests. This ap- 

losical Abs,rar<s ‘rinnfng of each issue under the 

re^Sn^N:”' rXrrJ^TS .he Hterature of re- 
^eth'u^nl tests and of hndings 

The Ediicalion Index covers a con J ^ j includes 

since it deals with the whole broad ^ „„,e tech- 

popular and professional materials , providing no in- 

nical and scientific nature. E-ves referenc 4 
formation about the nature and f h topics as ability 

cally organized, and the user »bo l^s o^“ ity tests wall 

tests, educational measurement, mental tests, P 
find most of the material raiating « mpsurement . d 

The joint use of the Psycholog, cot shoM 

Index, supplemented by the other sources ^ measurement 

enable the student who wishes to dig to ^ „„ that 

problem to locate the bulk of the work that has been 

problem. 


SUMMARY STATEMENT 

At the beginning of IhU chapter a number 
gested to which a test user might wish answ ers. Th ^ tasnnssed. By 
of information about tests and testing have now - questions, 

ssay of summary, we may try to relate the „f this 

An attempt has been made to do this in Rg. 8.1. P 

chart are listed various questions one might 'aise about a t ^ 
test, or testing problem. On the side are lasted «-= , 

types of source material referred to in this chapter. I » 
s^bol to represent the extent to which the *ouldJ=lp 

answering the question. The symbol * which one 

the sources that would probably be most helpful and 
would turn first. Sources marked • are ones that «ould a ^ 
pected to contribute to the needed answer. Sources ma 
ones that might perhaps provide some useful '"fo™^''™. . ,hnt 
there is no entry at all, the source is not likely to b. be P ‘ , 

connection. A critical study of this table, with an^ysis of the 
for the various entries, should leave the reader well prepar ^ 

and get for himself the information he needs in order to se 
or as background for a specific testing problem. 



218 


SOURCES OF TEST INFORMATION 


7 Buros, O. K.. The fihh me, ml yearhooh. Highland Park, 

N. J.. Gryphon Press, 1959- co,„m,c,hm. mclal 

S. Goheen, Hosvard W„ Sarnucl Kavro k, / Government Print- 
test theory, end stansitcs. Washington, u. C.. u. 

9. Hfld?S,'blmnde H., at hlhVo^rmUy ol men„.l lesn m.l rntlng -nler. 
New York, Psychological Corp., 1939. 

10. Hildreth. Gertrude H., A hihl,ogr„rl,y o/ ,„e,,ul lesl, md 
1945 s.ipplemeni. New York, Psychological Corp., 1946. 

QUESTIONS FOR DISCUSSION 

1. Using the sources indicated in the text, prepare as 
you can of currently available standardized tests in 

purpose (i.e., tests in first-year Spanish, reading readiness tests, 

American history for the twelfth grade, etc.). reviewers 

2 Using the Mental Memtiremem. 1 earhook., fnd out 
think of a particular test that you are interested in. 

3. Using the Fifth Mental Memirenteitts Yearbook, fnd 9“' 
viewers have to say about one of the following titles that interests j 

Doll, E. A., The hfeasurement of Social Competence. 

Eysenck, H. J., The Scicntirtc Sttidy of rersonality. „enl. 

Rmmets, H. H., Introdttetton to Opinion ‘p.flrenee to 

Sarason, S. B.. The Ciinicnl Interaction: ll'il/i Spectal Referen 
Rorschach. _ 

Strong, E. K., Jr., Vocational Interests 18 Years After College. 

4. To what sources would you go to try to answer each of lh= 
ing questions? To which would you go first? What would yo P 
to gel from each? 

a. What test should I use to study the progress of two class groups m 

beginning French? , , 

b. What kinds of norms arc available for the Stanford Achi 

c. Is the Rorschach Test of any value as a predictor of academic success 

in college? , . 

d. Has a new revision of the IVechster Adult Intelligence Sea e 

published yet? . , blind'’ 

e. NVhat intelligence tests have been developed for use with the 

f. What are the significant differences between the Melropo 
Achievement Tests and the California Achievement Tests. 

g. How much does the Otis Quick-Scoring Intelligence Test Beta. ^ ’ 

h. What do testing people think of the Brainard Occupational Prefer 
Inventory? 

5. Look at two or three publishers* catalogues. Compare the 
ments of tests of the same type. How adequate is the information 
is provided? How objective is the presentation of the tests values 
limitations? 



Chapter 9 

T 

Standardized Tests of 
Intelligence or Scholastic 
Aptitude 


ACHtEVEMENT AND APTITUDE 


Ability tests are designed to appraise what an individual can do 
under favorable conditions when he is trying to do his best. All any 
ability test measures is performance at the time of testing. From this 
performance we may hope to make one or more of a variety of dif- 
ferent inferences. We may want lo infer how eiTectIve a program of 
school instruction has been in teaching new knowledges or skills, i.e., 
how much progress the pupils have made in some kind of achieve- 
ment. We may tvant to infer how well each individual will do in 
learning some new task, 5 e., a prognosis of future achievement. We 
may want to make Inferences about the organi«ation or structure of 
human abilities, i.e., what goes with what. We may hope to unravel 
tl)e causal factors in individual abilities or disabilities, i.e., why the 
individual fails or succeeds with a particular task. All these are dif- 
ferent sorts of inferences. The basic evidence in every case is per- 
formance on a see of test (asks. 

Performances are lied with varying degrees of closeness to specific, 
organized instruction. At one end of the scale are those knowledges 
and skills that are the direct outcome of organized teaching, usually 
in schools but sometimes on Ibe job. To decipher the meaning of: 
Arma virimque cano or of 


-6' y 


rj 


are accomplishments that will be developed almost exclusively in a 
high-school course in Latin, on (he one hand, or Gregg shorthand, 
on the other. Even the ablest indmdual with a wealth of general 
219 



225 TESTS OF INTELLIGENCE 

life experiences is unlikely lo acquire abilities such as Lhese unless they 
iav: been specifically taught. We ^^prSn^a. 

Xclio; have been acquired. Tests thus Lied ^ t- 
concerned rsith evaluation of past progress are spoken of oe/i 
meat tests or proficiency tests. We shall consider them m some 

M tte other end of the scale are abilities that are 
the general experiences of life, quite apart from any o 
Consider the L pictures in Fig. 9.M and B. Suppose - were ^k 
a chUd, concerning each of them: What ts wrong ^ 

What is silly about it? As we went up the age range, 
more and more children who could give us a satisfactory answe . 
probably no child would have been specificaUy tau^t m sch 
shadows extend away from the sun or that m a wmd 
will be blowing in the same direction. The background to PP , 
the absurdity in these situations and the ability to isolate the 
elements in the pictures come with maturity from the general p 
ences of growing up in our society. 

It should be emphasized that any performance depends in so 
degree upon experience. A child from a culture that ha pro 
no experience with books and pictures would be less like y to s 



Rg. 9.IA end B. Pklxfe-obiufdihy-leti Hetni. 



achievement and aptitude 221 

ceed with the tasks of Fig. 9.1 because be had never learned to inter- 
>o real llaings in a raal world. A 
Child who had had no experience with chimneys or with flags «’ou]d 
be severely handicapped on picture B since he would not be able to 
interpret the picture or know how these things should behave. These 
two absurdities items assume (J) a gencraJ familiarity with pictures 
and the representation of things by pictures and (2) experience with 
trees, shadows, houses, flags, smoke, and wind. Any child in a normal 
American environment will have had these experiences in abundance. 
For him, therefore, the test provides a measure of perception, analysis, 
and understanding of his environment. Differences between individ' 
uals in performance on these tasks may then reasonably be expected 
to reflect fairly basic differences in certain aspects of mtellectual ability. 

The examples we have given have illustrated two points quite far 
apart on the scale ranging from ‘‘directly taught” to “acquired en- 
tirely from general life experiences.” Many abilities fall at inter- 
mediate points along this scale. The meanings of words, for example, 
are taught in school in connection with almost every segment of the 
school program. But a very large pari of our slock of word mean- 
ings is picked up in the reading and listening done out of school as an 
incidental by-product of just living in our society. Again, reading is 
usually first learned in school, but a large part of the growth in fluency 
of reading and depth of understanding of printed matter comes from 
out-of-school reading and from the general acquiring of experience 
and maturity as a part of growing op. There is no clear boundary 
line marking off the ability that is a school achievement from the one 
that is not. 

Psychologists and educators arc interested in measuring the under- 
lying aptitudes of human beings. The interest is sometimes in using 
these aptitude measures to predict later achievements. It is some- 
times in studying the aptitudes for Iheir own sake. Bui the concept 
of aptitude is a tricky one. Aptitude implies some natural or innate 
capacity for a particuhr type cS perJurmanez — scholastic aptitude, 
mechanical aptitude, or artistic aptitude. But all we can observe is 
performance on a set of tasks. As stated above, this performance 
inevitably depends in some measure upon the experiences that Ibe 
individual has had. If we want to get at basic individual differences 
in capacity to do a certain type of task, our only hope is to seek for 
test items based on experiences so common and general in our culture 
that almost every person will have had the requisite experiences. We 



222 TESTS OF INTEIUGENCE 

must build upon the common core of experience f 
is what aptitude tests aspire to do. They try to base thei P 

experiences, mostly out of school but overlapping to some extent 
prodded in school that are uniformly provided for individuals 
ing up in our society. They use these present f '’•''"n\*’“ed ineyi ) 
on a variety of past learnings, as indicators of what the indi l 


leam to do in the future. 

The difference between aptitude measures and achieveme 
ures is, then, one of degree and emphasis. Any test of 
some extent an aptitude test and to some extent an achievement t«i. 
The difference between the two designations is perhaps as much m 
type, of inference that we want to make as in the specific content or 
"innateness” of the measure. A test can be thought of as an achieve- 
ment lest when we wish to draw conclusions about past progress a 
as an aptitude lest when we wish to estimate future potentialities, 
remainder of the present chapter \sill be devoted to tests of genera 
intellectual ability, or scholastic aptitude. Chapter 10 will 
cerned with other types of special abilities, and Chapter 1 1 wt 
devoted to standardized tests of educational achievement. 


TASKS USED TO MEASURE ABSTRACT 
INTELLIGENCE 

Much of the research and development of aptitude measures has 
been devoted to devising and studying tests of "general intelligence, 
familiarly knowm as "IQ tests.” General intelligence, in this context, 
has typically meant abstract intelligence — the ability to sec relations 
in, make generalizalions from, and relate and organize ideas repre 
senied in symbolic form. What general intelligence has meant to those 
who have tried to lest it can be seen from the types of tasks they have 
used. Examples of a number of the common types of tasks are ^ven 
below. The keyed answers for muUiple-choicc items are underline 

VOCABULARY 

A word meaning nearly ihc same as robust is 

A. cheerfuL B. strong. C. faL D. small. E. wealthy. 

VERBAL ANALOGIES 

Branch is (o tree as brook is to 

A. water. B, root. C. bank. D. river. E. babble. 



TASHS USIO TO MUSirei *«T»Aa l*niUJG£*JCt XJ] 

sfwrfwcf cow/itffOM 

The vun ri\<r» in the ar^l vt» tn the »c»t 
A. tumnicr. R C e4« IJ t~\l £ il'v 

AJflTHMfTJC frASONfNC 

A Njv bc'Uj;hj CjnJ> hjn jt 00 crnt\ fiv K'» of k'J.J a} 

5 ceni* cJch. H<j»» much J/J he make <»n hjf* 

A. 30 ccniv n ) -I ecru ( l‘, cc'n t) ‘ <\r‘t 
l_ N.vbc t>f tVv 

WWBfR Sf«(fS 

Muj nimhcf OhmiU cevmr nett t»» O'r.tmu? ihc »ciw!t I I 4 ? U’ 

A. i-s n 15 < 16 n K f :: 

ncuMi ANAiociis 

□ A A A A n\' 

r-» ««c 


CMSSfffCAltOM 

Irtit ar Ihc th/cr **i> -S I'n Irfi .•» ' 

«itfi {h<MC th'cc? 

IVvlpf |j«>CT Irrrctf iA'r-.* A-t* ->1 'TeA* 



t »3 



^ TESTS OF INTEIUGENCE 

PICTURE ARRANGE/AENT 

The pictures below leU 3 storj-. W-hich picture comes first in the stors 















COMPREHENSION (COMMON SENSE) 

UTiat is the thing to do if you bump into someone and hurt him? 
S/M/LARiriES 

In s'.hat way are wool and cotton alike? 

INFORMATION 

\\’hal month in the >ear has the fewest days? 

DIGIT SPAN 

“I ssill say some numbers. Listen carefully, and when I am through 
repeal exactly what I said. Listen — 

3 8 7 1 5 

Now repeal what I said.** 

DIGIT SYM.EOI SUSSTITt/T/ON 

This b a code test. Each figure stands for a particular number. You 
are to put the right numbers in the boxes as fast as you can- 


OIZ7 


X 

o 

< 

Bsiltirai 








OBJECT ASSEMBLY 

These pieces, if put together correctly, will make a boy. Go ahead and 
put them together. 









GROUP INTEUIOENCE TESTS 


335 



GROUP {NTELtlGENCE TESTS 

^fost of the intciligcncc testing carried on in this country is done 
with group tests. TTiese are paper-and-pendi tests much lile the ob- 
jective type of school e.Taminaiion. They usually consist of 75 to JOO 
multipic-dioicc items of the types illustrated in the previous section 
Ordinarily, the examinee must read the problem to himself, must work 
along and do the tasks one after another, and must do as many as he 
can within a fi.xed time limit. Hemeier, some group tests call for oral 
instructions from the examiner, and some arc paced by the rate at 
which the examiner presents the test tavks. 

Some group intelligence tests (e f,., Califrtmia, KtiMnmnn-AnJerxrm, 
Lnrge-ThorneUke. PinWfr} arc made vp of several separately timed 
subtests, in each of which all the items follow the same pattern; ie. 



JJ5 TESTS OF INTElllGENCE 

all are vocabulary items or all arc number scries items. Others (e.g. 
Henmon-Nelso,,. Olis) have the dincrent types of itetns mixed m t^ 
gether, a vocabulary item being followed by a number-senes .tern 
that by a figure-analogies item, etc. The cycic of ^ 

items is repeated, the items gradually becoming more difficult, n 
type of test is called a “spiral omnibus” test because of the cyclical 

The typical group test is designed to cover a range of three or four 
school grades, i.e., 4 to 6, 7 to 9, 10 to 13. Tests for elemcntaiy- 
school children usually call for responses marked in the test booklet 
itself, but many of the tests for older groups use separate answer sheets 
that can be machine scored. 

There are a number of dilTercnt scries of group tests on the matke 
that are quite satisfactory to use. The number is too great to pertm 
discussion of each one here. Several arc listed, together with anno- 
tations describing and providing some evaluation of each, in P 
pendix III. ^ . ,. 

In the remainder of this chapter we will first describe the two mo - 
vidual tests, I.e., tests given to one examinee at a time in a face-to- 
face setting, that are currently most widely used In the United 
These are the Sian{ord~Binef Intelligence Scale and the Wechsler A u l 
InteUigence Scale. Next, we will discuss some of the special types o 
intelligence measures — tests avoiding reading and language, tests for the 
very young, tests designed to be free from cultural biases. Then v.e 
will compare group and individual tests, considering the advantages 
of each. Finally, the remainder of the chapter will be concerned wit 
evaluation, interpretation, and use of intelligence test results. 

THE REVISED S T A N f O R D - B I N E T TESTS OF 
I NTELIIGEN CE 

The individual test that over the years has had the widest use with 
school-age children is the Stanford-Binet, brought out by Lewis M. 
Terman in 1916. A revised version of the test was published in 1937 
by Terman and Merrill, and this has been somewhat further revised 
in I960.*' The current revision, which uses the best items from the 
two forms of the test brought out in 1937, is known as Form L-M- 
It provides a set of tests for each of twenty levels of ability, starting 
with tests suitable for the average 2-year-old and going up to four 
levels suitable for differentiating the abilities of average and supenor 
adults. To illustrate the content of the test, we have picked four levels 



THE REVISED SrANTORO^INET TESTS OF INTELLIGENCE 227 

at different points on the scale and Hsied the tests of each level with 
brief descriptions. 

TWO-AND-A-HALF-YEAfi LEVEL 

1. IdeniifyinR Objects by Use. (Card with 6 smdil objects attached.) 
‘Show me the one that we drink out of.** etc. 

Three out of 6 for credit at this level 

2. Identifying Parts of Body. (Large paper doll ) 

“Show me the dolly's hair." etc 

Six out of 6 parts for credit at this level.* 

3. Naming Objects. (Five small objects.) 

“VVbat is this?" (Chair, automobile, etc.) 

Five out of 5 for credit. 

4. Picture Vocabulary. (Eighteen small cards with pictures of com- 

mon objects.) 

“'Vhat's this? What do you call it’" 

Eight out of 18 for credit at this level.* 

5. Repeating Two Digits. 

“Listen; say 2." “Now, say 4. 7." etc. 

One out of 3 for credit 

6. Obeying Simple Commands. (Four common objects on table > 

"Give me the dog," "Put the button m the box." 

Two out of 3 correct for credit. 


SIX-'fSAP LEVEL 


1. Vocabulary. (Graded list of 45 words.) 

“When I say a word, you tell me what it means. What is an orange?" 
etc. 

Six words correct to receive credit at this level. Words like tap. 
gown.* 

2. Differences. 

“^Vhat is the difTcrcncc between a bird and a dog?” "Wood and 
glass?” 

Two out of 3 coned for credit. 

3. MulUated Pictures. (Five cards of objects with part missing.) 

“What Is gone in this pdurc?" or "What part is gone?" 


Four out of 5 for credit. 

4. Number Concepts. (Twelve 1-inch cubes.) 

“Give me 3 blocks. Put them here.” 

Four out of 5 different numbers correct. 

5. Opposite Analogies. 

"A table is made of wood; a window of — • 

Three out of 4 corrert for credit. 

6. Maze Tracing. (Nfazes. with start and finish points marked.) 

"The little boy wants fo go to school the shortest way without gel- 
ling olT the sidewalk. Show me the shortest waj". 

Two right out of 3 for credit. 


one or more other kvcK 


Scored also at 



228 


TESTS OF INTElllGENCE 


TWELVE.YEAR LEVEL 

1. Vocabulary, (Same as 6*year (cvcL) 

Fifteen words correct for credit at this level. Words like juggler and 
brunette. 

2. Verbal AbKiirtliiies. (Five statements.) 

“Bill Jones’ feet are so big that he has to pull his trousers on over 
his head. What is foolish about that?” 

Four out of 5 right for credit at this level. 

3. Picliire Absurdiiies. ■ r y u 

Picture showing person's shadow going wrong way. "What is foolis 

about that picture?” 

4. Repealing 5 Digiis Reversed. 

“I am going to say some numbers, and 1 want you to say them back- 
wards.” 

One out of 3 correct for credit. 

5. Abstract Words. 

“What do we mean by pity?” 

Three out of 4 for credit at this level. 

6. Sentence Completion. (Four sentences with missing words.) ^ 
“Write the missing word in each blank. Put just one w'ord in ezea. 
Three out of 4 required for credit at this level. 

SU«R/OB ADUlT-lEVEl It 

1. Vocabulary. (Same as 6-year level.) 

Twenty-six w'ords for credit at this level. Words like mosaic, flaunt. 

2. Finding Reasons. (Two parts.) 

“Give three reasons why a man who commits a serious crime should 
be punished.” 

Both parts right for credit. 

3. Proverbs (Pcark before swine, etc.) 

“Here is a proverb and you are supposed to tell what it means. 

One out of 2 correct for credit. 

4. Ingenuity, 

A 5-pini can and a 3-pint can to get exactly 2 pints of water. 

Three out of 3 problems correct for credit. 

5. Essential Differences. 

“What is the principal difference between work and play?” 

Three out of 3 correct for credit. 

6. Repealing Thought of Passage, 

Short paragraph on the value of life. 

Four out of 7 essential ideas must be reproduced for credit. 

The above examples illustrate the variety of material included in tbo 
test. Note that the speciSc tests vary from one level to another. Many 
of the tests at the lower age levels are quite concrete, dealing with 
little objects and pictures. At the upper levels, the tests tend to be 
more abstract and quite heavily verbal. The various tests include 



THE REVISED STANFOkD-tINEJ TESTS OF INTCltfGENCE 229 

tasks callinj for display of past leaminss, perception of relations, jud"- 
ment, mterprelation, sustained attention, immediate memory, and 
otrier cognitive processes. 

The tasks were selected so as to be of appropriate difficulty for the 
average child of the age level to which they were assigned. In testing 
a child, the examiner begins at a level where the child is likely to suc- 
ceed, but only with some effort. If the child fails these and appears 
di.scouraged, the examiner will drop hack to an easier level. Other- 
wise, he will move ahead level by level until he reaches a level at which 
the child fails a)! tests. When the upper limit has been established, 
the examiner will be sure to go back and establish the level at which 

the child can do all the tasks. Often, a few quite easy tests will be 

given at the end to build up the child's morale. 

The child is credited with the basal age at which he passes all tasks 
plus a credit for tasks passed at more advanced levels. Each task 
passed at a given level credits the child with the same number of 
months of mental age. Thus, where there are 6 tests at each year age 
level, passing a single test gives a credit of 2 months of mental age. 
For example, child A 

Passed all tasks at ^year level » 6 yrs. basal age 
Passed 3 of 6 tasks at 7-year level = 6 mos. credit 

Passed 1 of 6 tasks at 8-ycar level * 2 mos credit 

Failed all tasks at 9-year level « 0 credit 

Resulting in a mental age of 6 yrs.. 8 mos. 

Level of achievement is expressed as a mental age, arrived at as 
indicated above. The mental age describes the level at which the child 
is performing. But this takes no account of the child's life age. Per- 
formance in relation to a group of children of his own age is expressed 
as an IQ. The IQ’s for ibis latest revision of the Stanford-Binet arc 
deviation IQ's, i.e., they arc essentially standard scores for which the 
mean is 100 and the standard deviation 16 at each age level. In so far 
as the normative groups ore adequate and comparable from one age 
to another, an IQ has the same meaning at one age as at any other. 
Tables for converting MA’s to IQ's arc provided from age 2-0 (2 
years, no months) up to age 16-0. For individuals over 16 years of 
age the table is entered with a chronological age of 16-0. The way 
IQ’s spread out is shown in Table 6.7 (p. 145), Thus, a child with 
an IQ of 130 would surpass about 95 per cent (95.5 per cent by Table 
6.7) of children of his age; one with an IQ of 90 would surpass about 
23 per cent (22.7 per cent by Table 6.7). 



230 


TESTS OF INTELLIGENCE 


the WECHSLER INTEltIGENCE SCALES 

The second major individual intelligence test 
Inlelligence Scale OVAIS)c' This lest was f 

adults, and the materials and tasks were chosen 
appropriateness for adults. The pattern of hT 

dfflers from that of the Binet. Whereas the Bmet. developed for chil 
drfn isTganized in successive age levels, the ..'.1/5 - 
subtests representing t>-pes of tasks. The subtests are the folIou.n„. 

Verbal Subscale Fer/ormaace Subscale 


1. General Information. 

2. General Comprehension. 

3. Arithmetical Reasoning. 

4. Similarities. 

5. Digit Span. 


7. Digit-Symbol Substitution. 

8. Picture Completion. 

9. Block Design. 

10. Picture Arrangement. 

11. Object Assembly. 


6. Vocabulary'. 

Tasks like those in a number of the subtests will be found among the 


examples on pp. 222-225. . 

Each subtest of the WAIS yields a separate score, which ts then wn 
verted into a standard score for that sublest. The subtest stan a 
scores are combined in three differem groupings to yield 
and from these total scores three different types of IQ's may be re 
from norm tables. The three IQ's arc (1) a verbal IQ from subtesis 
1 through 6, (2) a performance IQ from sublcsts 7 through 1 » 

(3) a total IQ from all the subtests put together. The separate ver ^ 

and performance IQ's may have diagnostic significance in the 
certain individuals with verbal, academic, or cultural handicaps. - 
IQ on the WAIS is also a standard score, set to make the mean o ^ 
normative sample 100 and the standard dev’iation 15. 

As we have indicated, the original Wechsler Intelligence Scale "3^ 
designed for adults. It was suitable for use with adolescents an 
with adults of all aces. Subsequently, however, the material has en 
extended downward to make a test for children.®* The same genera 
pattern of subtesis has been used, though with minor variations, 
particular, the nature of the tasks in wveral of the subtests changes ^ 
one goes down to the easiest items. The Wechsler Intelligence Sea e 
for Children (lPi5C) is designed to be usable from age 5 to 15. 

The features that distinguish the WAIS from the 1937 edition of t e 


Sianford-Binet are: 


1. Original test items specifically designed for adults. 

2. Organization by subtests rather than by age levels. 

3. Provision for separate verbal and non-verbal IQ's. 



331 


NON IANGUAG6 ANO mfORMANCE TESTS 
All these features seem like sound adaptations in a test tor adults. 

probaWy agree now in preferrinc ihe 
HA/i as a measure for adolescents and adults, ihouch its relation to 
academic success is perhaps not as dearly established as is the Binec's. 
(As a matter of fact, at these ages a primed group test would usually 
seem more appropriate for academic prediction. > 

The iVlSC cannot be used with children below 5 and is probably 
not very satisfactory below the age of about 7. For young children 
the Dinct would be generally preferred. In the age range from 7 to 
15, a decision between the two tests is not an easy one. The Dhiei is 
reported to be somewhat more drflicuU and tinie*consu(ning to gicc. 
The usual Binel procedure of carrying the examinee through to the 
point where he encounters a long scries of failures is judged to be a 
seriously upsetting matter for some cmoiionnlly tense children. The 
separate verbal and performance lO's of the WISC should be rjuiic 
useful in some cases in understanding children whose verbal develop- 
ment is cither very accelerated or Ktardcd. It has diagnostic value 
for some children svith special educational disabilities. However, the 
Binfi is probably a somewhat more reliable measure. (No directly 
comparable data arc available.) The lest hems entering into the Binfi 
have had the benefit of trial in earlier forms, with opportunity to revise 
and select on the basis of that experience. The ultimate basis for 
choice will be the validity of the inferences that can be made from 
each In the situations in which they arc actually used. Prediction of 
academic success can apparently be m.We about equ.illy well from 
cither test. It seems likely that the two tests arc about equally useful 
for children with mental ages of 7 or above. 

NON-LANGUAGE AND PERfORMANCE TESTS 

Most of the widely used imclHgence tests depend to some degree 
upon language and include tasks prc'cnled in verbal terms. This Is 
natural, since the bulk of our learning and thinking makes use of lan- 
guage. For the usual person and in relation to the usual type of aca- 
demic learnings, aptitude for learning can be rested more eflicieniiy 
by tasks that involve language than by those that do not. However, 
for some groups or situations this is rrot so. The most obvious example 
is that of groups who do not speak the language or speak it only 
slichtly. When an individual has limited command of English, results 
from a verbal test in Enclish arc in lurgc me.isure meaningless. Chil- 
dren who have bad little opportunity to attend sefioof may suffer s 
special handicap on a test that relies upon materials close to scbrwl 
learnings. For groups of Ibis sort, tests have been developed iha. to 



232 


TESTS OF tNTEUIGENCE 


not require language. In some of Uie^, only ihc test tasks are non- 
language in character; in others the instructions can be given by panto- 
mime and no language need be used at any point during the testing. 

A group lest that requires no language in solsing the test problems, 
though the instructions are presented in words, is the Lorge^ThnrmftLe 
Intelligence Test, Non^t'erbal Series. TjTJes of tasks that arc included 
are figure analogies, figure classification, and number scries. (See 
examples on p. 223.) A group lest that dispenses with language in 
both instructions and test is the Pintner Non-Language Test, in which 
all instructions may be given by pantomime. The test indudes the 
following types of tasks, which arc illustrated in Fig. 9.2. 


Testl 


□ 

4 

a c 

- / 

> o 

1 

o 

□ 

Cl] 

Test 2 

V A 

\ 1 

Lo. o 

— / 

® o 

Testa 

B 

[ 

B 

i 

3 □ 

o o 

□ 

□ 

o 

Test* 

Cl 

o g 

O D 
O O 

□ c] 

6 0 

Tests 

-0 


J 


3 o 

□ 

C3 


B B 

o O 

0 

0 

o 


Rg. 9A 


WotM Boot Ce. 


’infw N9n4«f>guog9 lrdttZ^et>ct Test. (Ccpftig^l ^941, 
YerWfc H. T. Bcproduced by ptrmiuiea.) 





233 


NON LANGUAGE AND PERFORMANCE TESTS 

3. Pmim, synilum, mdicating the figure llMl uill result Irow surer, 
imposing two figures. 

4. Movtmat stqmnct, scicafns the figure that fullusrs the movcmem 
sequence established by three figures tn the stem of the ilcm. 

5. Manikin, selecting the manikin that is the same as the one in the stem 
except rotated in some way. 

6. Paper Joldiijg, seleciing ibe diagram that shows how a paper folded 
and cut in a spccid^d way will took when unfolded. 

When used with ordinary school groups, a test such as the Pintner 
Non-Language provides an appraisal of intelligence somewhat distinct 
from that provided by a verbal measure. Thus, this test correlates 
only about .65 with the Pintner General Ability Test, Lansiiase Series, 
a test made up of verbal and arithmetical materia). With usual groups, 
the non-language test may be expected to be somewhat less effective 
as a predictor of school achievement The value of the non-language 
test is for atypical individuals or groups, t.c., the deaf, the foreign born, 
or the academically retarded. 

Individual tests are also available that do not require the use of 
language. We have already described the tf'ecfisler AcMi tntedigenee 
Scale and referred to the performance JO provided b)- this test. The 
performance IQ U based upon five subtesis that do not require the 
subject to use language once he has been instructed as to the nature 
of his task. A performance test that is widely used with children, as a 
supplement to the Binet when a verbal handicap is suspected, or for 
groups with which the Binet would not be appropriate is the Arthur 
Point Scale,* We shall describe it in some detail, since it is a good 
representative of individual performance tests. 

The Arthur Point Scale consists of l%vo forms, which contain some- 
what dilTercnt tests. Form I has nine subtests, as follows; 


1 . Knox Cubes: The examiner taps four cubes in a specified sequence, 
and the subject must reproduce the sequence. 

2. Seguin Form Board: Ten geometric figures arc to be placed into the 
corresponding holes in the board as rapidly as possible. 

3. Two-Figure Form Board: Cut-up pieces arc to he fitted into a square 
and cross cut out of the board. 

4. CoJHirt Form Board: Similar to the above, only four figures. 

5. Manikin or Feature Profile (depending on lescl); Cut-up figure of 
man or cut-op face is to be assembled. 

6 Mare and Fold. Picture has cut-outs that arc to he fitted inio plac^ 

7. JteaJy Picture Completion I: Picture has square cutouts and su^ 
jeci must select the appropriate bteck to make the most meaningful p.cture. 



TESTS OF INTElllGENCE 

234 

8, Porm,s Mazes: Simple pencil mazes are .0 be traced without rctrac- 
D««u.- Designs are to be reproduced using colored cubi- 
cal blocks, like those in sets for children. 

Form II of the test also uses the Knox Cube, 

Healy Picure Cample, ion. and Por,ens Mazes, f “"S = 
form or a different set of tasks from form /. In P'““ 
tests, however, it substitutes the Arihur Slencil Oes'S't • 
lest, the subject is supplied with a set of rolored cards a ^ ^ 
cut-ouu of different designs and colors. The subject 
Sian that can be produced by superimposing certatn ^ 

provided to him. He must select the right cuI.out5 and 
and put them together in the right order to produce the master des 
A point score is allowed the subject for his performan^ on c^n 
subtest of the Arthur Scale. The score depends in some ^ 

the speed with which the task was completed, in others upon th 
rectness of the solution or the number of graded tasks so>'_nd. 
point credits lor the subtests ate summed to give a tot^ pomt s . 
and this is converted to a mental age equivalent. An IQ is 
by dividing mental age by chronological age. The IQ's appear to 
about the same distribution as for the /ieviseci Btnet. 

There have been a number of other aitempu to evaluate intelleaua 
ability through performance tasks, ideally ones that would be usaD e 
in different countries and different cultures. One of the most wi 
known is the Coodenough Draw-a~Man Test,^^ in which^lhe c 
told simply, “Draw a man— the best man you can draw.” The 
formance Is scored on completeness and maturity of representation, 
not on esthetic qualities. 

The individual performance test must generally receive the sam 
evaluation as group non-language tests. For an English-speaking p^ 
son with normal environmental opportunities and without sped 
language or reading handicap, it represents a less efficient way of aj 
praising mental development than the more widely used verbal te 
However, as a way of checking on whether there is a specialized an 
guase handicap it represents a valuable supplemental tool. It ma^ 
it possible to check upon individuals who appear retarded on the ver a 
type of test to see whether the retardation is genera! or whether it >5 
localized deficiency in the language area. A performance test such as 
the Arthur Point Scale, which can be gis’en with pantomime instruc 
lions, is also useful in testing deaf children, non-English-spcaking ch 
dren, and other types of special groups. 



INFANT AND PRESCHOOt TESTS 


235 


INFANT AND PRESCHOQi TESTS 

The first inteJligence tests were made for school-age children. How- 
ever, It was not long before the Iheoreiical interests of child psvcholo- 
Sists and the practical needs of chiJd-care and placement agencies 
stimulated the attempt to develop procedures for appraising intelli- 
gence in preschool children and ev-en in infants. Any appraisal pro- 
cedures with j'oung children obviously had to be individually adminis- 
tered. Also, they had to be based upon behavior that was spontane- 
ously exhibited by or could be elicited from children of the age being 
studied. Infant tests, therefore, had to take on a very different char- 
acter from later appraisals. Arnold Gesell “ pioneered in designing 
tests based on observation of the child's postural, perceptual, manipu- 
lative, and social responses. Does be sit up? Stand up? Walk? Will 
he turn to look at a light? Notice a face? Can he pick up a block? 
A spoon? A little pellet? By what type of a grasping motion? How 
does he react to strange adults? To another Infant? 

Observations of large numbers of infants showed a typical devel- 
opmental sequence in the different aspects of the child's development. 
Performance B followed A. and was followed by C. Norms have been 
established representing the average age at which a particular behavior 
manifests itself. The child may be assigned a developmental age, 
based upon the behavior he shows. Relesis after a short interval show 
the child to be fairly consistent in his level of performance. If he is 
advanced at one testing, he will tend to be advanced at the other. The 
developmental schedules provide a moderately reliable picture of the 
individual at that pof/u in time. 

What significance does acceleration or retardation in development 
during the first year or so of life have for predicting later intelligence? 
The answer is well presented in Table 9.1, which shows the correlation 
of infant tests given at the ages of 1 to 12 months with intelligence 
tests at various later ages. The tests during the first 15 monihs were 
those of the Cali/ornia First-Year Mental Scale, those from 18 months 
to 5 years were the California Pre-school Scale, and those from 6 years 
on were the Stanford-Binet. 

The picture seems quite clear. The infant tests give a fairly pood 
prediclion of developmcnla] slaws a few monihs bier, but their value 
as predictors drops rapidly as the interval increases. The infant tests 
provide essentially no prediction ot intellectual status at school ap. 
Whatever factors produce differences in rale ot development dunns 
the first year or so of life arc entirely distinct from those that deter- 



236 


TESTS OF INTELLIGENCE 


During First Year of life vrilh 
(Correlations based on pooling of successive tests) 


Toble 9.1. Correlation of Intelligence Tests 
Later Measures * 


Age al Initial Test 


Age at Later — 
Test 1. 

4, 5, 6 mos. 

7, 8, 9 mos. 

2, 3 mos. 
.57 
.42 

4, 5, 6 mos. 

.72 

7, 8. 9 mos. 

10, 11, 12 mos. 

.28 

.52 


13, 14, 15 mos. 

.10 

.50 


18, 21, 24 mos. 

-.04 

.23 

.39 

27, 30, 36 mos. 

-.09 

.10 


42. 48, 54 mos. 

-.21 

-.16 


5, 6, 7 yrs. 

-.13 

-.07 

.02 

8. 9, 10 yrs. 

-.03 

-.06 


11, 12. 13 yrs. 

.02 

-.08 

.16 

14, 15, 16 yrs. 

-.01 

-.04 


17, 18 yrs. 

.05 

-.01 



10, 11. 12mos. 


.60 

.45 

.27 

.20 

.19 

.30 

.23 

.41 


• Tests used were: 1-lS months. California tirst-rear Mcmuj . 
months-5 years, California Pre-school Scale: 6 years and older. 

Bluet. From Bayley.* 

mine intellectual level at school age. It seems, then, that little 
tical significance can be attached to results from infant developmen a 
schedules. They describe an aspect of the child which is temporary 
only, not lasting. -i , r r 

There have been a number of different tests prepared priman > o 
use with preschool children, i.e., the age range from about 18 mont s 
to 5 years. As a matter of fact, as v.'e have seen, the Stanford-Bine 
Intelligence Scale has tests going down to the 2-ycar level and may - 
considered a preschool test. It would compare very favorably 
the other tests available for this age level, though it is somewhat more 
verbal than many of the others. A good many of the preschool tests 
have tended to get away from the verbal material that appears so 
heavily in group tests for older children and also in the Stanford-Binet. 

One test for preschool children that has received wide use is the 
Merrill’Palmer Scale.-* This is most suitable for children from 2 to 
4, though it can be used with children slightly older and slightly 
younger. The test is made up of 38 little subtests, of which only 
call for verbal response by the child. A number of the tasks call for 



»NfANT AND PRBCHOOl TESTS 


237 


gross motor coordination (standing on one foot) or finer eye-iiand 
c«rd,na„an (building block tooer, cutting with sciss“) Fol 
nd ijject perception and motor control combine in a number of 
form-boards in tvhich cot-outs must be filled into the appropriate 
'"•■'"rials interesting to the 
chdd blocks, pictures, scissors, balls. Me., so that cooperation can 
usually be obtained, a real problem with children at these ages. 

Th(^ MerriU-Paimer Scale has fairly satisfactory reliability, especially 
above about 30 monihs. Correlations with retests 6 months later have 
been reported ” as follows for dilTcrcnt age groups; 


24 months .63 

30 months .76 

36 months .78 

42 months .SO 


The correlation with school-age Binet is about .40 for a Merrill-Palmer 
test at age 2; about .45 to .50 for one at age 4, 

The Minnesota Preschool Scale '* is another example of a test de- 
signed for preschool groups. The 26 tests in this scaie tend to be more 
like those of the Binet. Six tests taken at random from one form of the 
iJcafe are described briefly. They are 


Test 2: PoifU/ns Out Ohfccti in Pictures. Card w«h man. chair, apple, 
house, and flower on it. Child is asked to point to each in turn. 

Test 5.‘ Jfulta/ive Draning. Experimenter makes vcnica] stroke; then 
a cross. Child is asked to imitate each in turn. 

Test 8: hnitatlon. A set of 4 cubes, on which experimenter taps in speci- 
fied sequence. Child instructed to imitate (he sequence of taps. 

Terr 14: Colors. Cards colored red. blue. pink, white, and brown. 
Child is asked to name the color. 

Test 20; Paper Polilm’. Ex.iminer folds paper with three consecutive 
folds. Child is asked to copy exactly. 

Teu 24: Giving Word Opposites Child »s asked to give words meaning 
opposite of cold, bad. thick, dry, dark, and sick. 

Test materials arc quite sirnpic. Copj^ng, imitating, and responding 
10 simple verbal relations enter into a number of the tests. 

This test appears to be somewhat more reliable than the Merrill- 
Palmer. Correlation between two forms of (he test given within a few 
days of each other was found to be .89. Below 3 years, this test did 
not correlate very well with later Binets. but the Minnesota given be- 
tween 3 and 4 gave a correJation with Binets at school age of about 
.60. However, IQ’s on the Minnesota Preschool Scale have quite a 
different spread from those for the Binet so a preschool IQ on this lest 
is not readily equated to later Binet performance. (See reference 14.) 



238 


TESTS OF INTEIUGENCE 


CULTURE-FREE AND CULTURE-FAIR TESTS 

Mnny workers in the Held of aptitude testing have been distressed 
by the fact that test performance depends upon the exp 
pLon has had. Every test maker has reeogn.zed th.s ^ 

has tried to base test items upon cxpencnces that wouid be 
to the group tor whom the test was planned. But some have P P 
taken too narrow a view of the group for whom ettpenences 
be common. Certainly the test that ineorporates pictures of th ^ 
American house, automobile, or football is not suitable 
tralian Bushman who has seen none of these objects. yP ^ 

American test assumes the common core of an 
Some critics have gone further and asserted that the ty^ 
based upon an urban middle-class American culture. o ^ 
highly verbal content and its emphasis upon speed, 
doing one's best, it is said to be centered in the middle-class culture 


and values. , ^ _ •‘-..ifnre 

Several attempts have been made to develop tests that arc c 
free,” or if not that at least ‘‘culture fair.” These arc closely re a 
to the non-verbal and performance tests described in the previous s 
tion, because a culture-free test is almost necessarily non-verbal, 
must not only be non-verbal but must also be free of the conten 


any particular culture. , 

One attempt to develop such a test is the Cancll Culture Free 
Ugence Test. The Cattell Test is based on the premise that 8f 
telligence is a matter of seeing relationships in the things with w ic 
we have to deal, that the ability to see relationships can be tested 
simple diagrammatic or pictorial material, and that for a test to 
usable in different cultures the pictures should be of forms or objec 
which are fairly universal, i.e., not peculiar to any cultural gremp- 
Items illustrating the different types of tasks are shown in Fig- 
The evidence that the test is in fact useful for widely different cultures 
is largely lacking, but the tasks constitute one further interesting non 
verbal group test that may prove usable, particularly in research stu «es. 

One test that was developed in Great Britain and has been use jn 
many countries is the Progressive Matrices Test.-'’ The type 
is similar to the last two samples in Figure 9.3. Two types of progres 
sion or relationship are established, one in the horizontal and one m 
the vertical direction. The examinee is required to pick the choice 
that correctly fills the missing entry in the lower right-hand corner o 


the matrix. 



CUtlURE.fREE AND CUlTURE-MIfi TESTS 


23P 


PART l-CtASSinCATIONS 



PART V . MATRICES H PART VI - MATRICES III 



ffg. So-.pl* H.mi f«m Co»*« C.hor^fr.* T«'- ’ g ” * 

,1 fo, P,^nol!tr end AbH.V T««n». >M2 Co.oWo Drl... Cl.on.po,gn. III. Reproduced 
by porminloik} 


An attcmpl lo develop a lev. that imposes no penal^ on difftrenl 

0 ,a«:st 24 :,n society is fonndin the 

snmaWy the child is supposed to be na,« “'I 
a test.) This test series involves no wntten laoEuage but does reqo 
quite Ions oral directions. Types ot items include. 




TESTS OF INTEIUGENCE 

1. Best ways, in which three pictures are shown in the test toom 
and the examtaee is orally instructed to ntark the one that rs the best 
way to carry a pile of packages, get over a fence, e c 

2 Analogies, in which the analogies ate presented i p _ 
are of the type. “Glove is to hand as sock is to: arm, leg, f^- 
3. Probabilities, in which a picture is shown and the exatmn 
select the one of three orally presented choices that ind.cat 
probably led up to or is represented in the picture. cer- 

^ 4. "Money," a task based on complex directions for following 
tain rules for combining coins to make specified sums. 

This test was designed to avoid the cuilural 
socio-economic biases within the American culture, 
terixe previously existing tests. However, studies of ‘"n t«t m recen^ 
years have faded to confirm that it does so. IQ s from th 
Games are found to have about as high a conelat.on ^ 

socio-economic status as those for any other test. We 
that either there basically is a relationship belwe^ f' f.-jla to 
socio-economic status, or that the Davis-Eells Gomex has faueu 
eliminate the bias which its authors believed to characterize other te«^. 
Since this type of test is laborious to give and relatively unrel . 
has little to recommend it on other grounds. We must , 

it does not appear very useful as a measurement tool at tne p 


GROUP VERSUS INDIVIDUAl TESTS AS MEASURES 
OF INTEIUGENCE 


We have seen that inlclliccncc tests fall into two main ’ 

group tests and individual tests. The types of tasks presente 
examinee arc a good deal alike in both patterns. However, l e 
procedures have certain significant differences. These may be s 
marized as follows; 

Group Tests Individual Tests 

Problems presented in printed Problems presented ^ 

booklet. Read by examinee. examiner in face-to-facc situ - 
Personal contact with examiner tion. 

a minimum. . „ 

Tasks presented and test timed Problems presented one at 

as a unit, or separate lime limits time, usually without indica i 
for each suhtest. of lime limits. 

Individual usual!) responds by Individual usually respon 

selecting one of a limited set of freely, giving uhatever respons 
response options printed in the seems appropriate to him. 
test booklet. 



GROUP VERSUS INDIVIDUAI INTEUIGENCE TESTS 241 

These diflerenees in procedure have several important implication 
for the conduct of testing and for the results that may be ■>« from 
sueh testing. In the first place, »hen test tasts ate presented orally 
to the subject and he does not have to read them 
fotmance is much less dependent upon his reading shills. The cmid 
who has lagged behind in acquiring these skills is not penidized te 
this specific^tailurc. The efiect of reading disability upon 

Te;: is -7 "rr“^\“s:rmt 

ing was a year or than the indl- 

S^mn/^rBletTa mere +pLT highe” 

below the Stanioril-Binel 10. . ’...jjj reader an 8 point pen- 

prim^SVooP 

■'^^r^:^^^rahoveam.r^— 

the particular group test ^ ,.;al,ool children, for whom the 

study was carried out with ' ™ ?ents something of a task. One 
actual operation of J^ould be found for high-school or 

may anticipate that less ° group tests arc either 

coltee students. Furthermore, so relatively 

partly or wholly non-language '" *' points out very 

independent of reading sktUs. Ho i„,en>reted or 

clearly the caution with ® "Jha average in his reading slJ Is. 

a person who j/reade, e=""0‘ ”= ''le 

A low group test 10 for a P"”' , p,a, does not involve reading^ 

It should always be checked with examiner is also 

me presentation of problems fte test is iMy » 

a fnetor of some signifieance to "“"'maintaining contmutty of 
ieid Especially with » P-*"’ 

Attention and " ®ra'’ri:„ifican. factor in test 

What is equally '"’P”'* ’ d to take some account of 

interview situation. *n 



2^2 TESTS OF INTEIUGENCE 

specmcally formulated, and 

aLg his responses. However, at <>«= ^ twiner a 

lionship of an interview prevails. This i ^or 

wealth of opportunities for observing the examinee and Si 
motivation, distractability, signs of anxiety and upset, an 
that will help in interpreting the aetual test ^ 

time, the demands upon the examiner are ^ ^y, 

valid testing is to result, the tasks must be ^ „„itorm 

interest and cooperative effort must be maintained, and a 
standard must be applied in evaluating responses i„,erview 

The free-response item in the individual test fits int 
setting of the individual test and reinforces both its «renSt 
limitations. Potentially, the free response of the ‘.ten 

more about him than the mere record of which op ion 1>= 
from a set of five. There is more of the quality of h.s "^r 
available to us. We can see just how he goes about defining ■ 

whether by class and differentia (i.e., an orange is a round, ora g 
colored, citrus fruit) or by use (an orange is to eat). We can 
the speed and sureness of his attack on a problem task. rt„«es 

also depend on the examiner to interpret and evaluate the ’ 

and at this point subjectivity is likely to creep into the ex 
Careful attention must be paid to the standard samples 
test manual, and experience under supervision is indicated be o 
examiner can expect to give and score an individual mtelligen 
in a way that will yield results comparable to those of other ' 

In general, the limitations of group tests are most acute ® 

vantages of individual tests most pronounced with young c i r 
Printed group tests cannot be used successfully with childreii 
school age. They cannot read and have difficulty in manipu ^ 
pencil, following instructions, or maintaining sustained alienuon 
the period that is required for taking a test. These same factors co 
tinuc to present fairly serious problems for testing in the prim 
grades. However, the factor of cost makes individual testing 
tical for most large-scale users of tests, so that with older indivi ua^ 
the overwhelming majority of the intelligence tests used are paper an 
pencil group tests. 


RELIABILITY AND STABILITY OF MEASURES 
OF INTELLIGENCE 

We have already presented some evidence on the reliability of meas 
ures Of intelligence in our discussion of infant and preschool tes s. 
The reliability of those early measures is found to be quite mo es . 



RELtABIllTY AND STABIUTY OF MEASURES OF INTEUIGENCE 243 

For tests at school age, reliabilities are more promising. Considering 
fte graup tests ftrst, L had that »hca correlations between two forms 
of the same test are reported for an age P-P ' 

usually tall between .SO and .90. A few are "“Tu u 


n age group or a 

dllTieult to estimate how „ Jude diffieult by 

reliability and in Ute 

Form M of the 2 to 6. the median value 

95 for difTcrent age groups, t-or ages iron 

was .88. whereas for ages -hove 6 the Jer items from 

Since Form L-M was prepared by “ ’"S reliability of 

both Form L and Form one may^^^^ reliability ri^ 

the new form is at least as ^ /nre///gerire Scale is .96 

ported in the manual tor -97 

tor the verbal IQ, .93 ’’'i.uyntiB, and consequently should 

scale IQ. These arc ,,e reliability of the Biael. 

be discounted somewhat m ° J JJe„ee Sente for CWIto 

Split-halt reliabilities lot •>'' "/ J and .94 at age 13(4. 

a creportedtobe.92 atagc7k^95 a^9S ^ufebUity and 

Thiugh the variations ■" P J arrive at an unequivocal 

in type of group *^5*®*^ 5nielli"en« tests yielii “ some- 

answer, it docs seem that •'jf commonly used E™“P “'“j 

what more reliabie ^“tVot the somewhat^longer actual 

This is probably in ^ uiore uniform motivation and 

when wS unto 'he ’aTmasonably satMaetoOb 

The reliabilities ot '"“''f ,^*01 measuring mst u 
they arc among the most depe^a P yj enough to require 

be remembered that ■>>« bp.® » ,„„st J7,o5, 

115 .” 



TESTS OF INTEIUGENCE 


Toble 9.2. Distribution of Stonford-Binet Form M IQ's for Coses with 
Identicoi Form L (Q's 


IQ / 


113+ 

3 

108-112 

9 

103-107 

23 

98-102 

30 

93-97 

23 

88-92 

9 

87 and below 

3 


STABIUrr OVER A PERIOD OP YEARS 

In addition to knowing the precision with which an intelligence test 
appraises an individual's abilities at a particular lime, we would like 
to know how consistently the individual maintains his position in hts 
group from one year to the next or over a considerable span of years. 
How confidently can we predict what scholastic aptitude an individual 



REllABtllTY AND SIfABlimf OF MFaSUSES 


OF INTEIUGENCE 245 



Fig- F.9. Sffwf of fn(*rvol on protfictioo of g^ewp i*it {nulf^onc* et onii of high 

•cftoof from eortior growp |Sn,4)r A «dop>#d from J. t. Ar.dor»oJ>.‘ Stirdy J edopFod 
from ft. L ffiorrtdiko.^l 


will show when he is of college age from his performance on a test 
at age 2? Age 6? Age 10? Evidence on fhis point is presented in 
Figs. 9.4 and 9.5. 

Figure 9.4 shotvs the findings from one e.Tfensive study using indi- 
vidual tests. TTie final test Is the Stanford-Binet in every case. The 
Initial test is the Calilarnia Pre-school Scale up through 5 years and 
the Sianford-Binei after that age. Note that for the early tests the 
prediction is rather poor and drops as the interval is increased. A 
test at age 2 correlates only .37 with one at age 6 and .21 with one 
at age 14 or IS. As v.e go up the ^ range, however, the correlations 
are higher and the drop is less. A test g«w at age g or 9 correlates 
.88 with one at age 10 and still correlates .86 witli one at age 14 or 15. 
For normal children In a typical environment, a Stanforti-Binei at age 
8 or 9 appears to provide almost as accurate a forecast of ability near 
the end of high school as would the same test given several years later. 

Two sets of data on stability of group-test performance over time 
are presented in Fig. 9.5. The twio fdlow the same general pattern, 
though they differ a good deal in detail. As we go back further in 




tests of INTElllGiNCE 

time the correlation coefficients tend to drop more or ^adily^ 
X’eal: tests at around grade 3 or 4 -relate perhaps 5^ 
with the final test, but for a test m grade 9 f 

70 to .80. In these studies of group tests, the tests * ^ 

diflered at the different ages. For this reason, it ts 

the lower correlation over the longer intervals 15 due to gr 

in tie subiecs over a span of years and how much it is due change 
in the material included in the tests. From the practical point f 
Fi» 9.5 suegests that a group intelligence test needs to be supp 
mented by testing every 3 or 4 years if pupil records are to pro 
vide an accurate indication of current ability level. 


THE 

DIFF 


practical importance O 
erences in measures of 


F individual 
intelligence 


To what extent are the individual differences that 
by tests of intelligence of importance in the practical ° 

Do they enable us to predict to a useful degree how an individu 
perform in school, on a job, or in other life adjustments. 


INTELLIGENCE AND SCHOOL SUCCESS 

First, let us consider academic success. From the many ^ 

of investigations of intelligence test scores in relation to academic s 
cess, a number of conclusions can safely be drawn. These may 
summarized as follows; 

1. The Correlation of InteUigence Test Score with School Marks Is 
Substantial. Viewing all the hundreds of correlation coefficients 
have been reported, a figure of .50 to .60 might be taken as fair y rep^ 
resentativc. Though this constitutes a very definite relationship, i ^ 
only necessary’ to turn back to Fig. 5.7 and the discussion of 
tion on p. 119 to realize that there arc still many marked discrepanci ^ 
between intelligence test score and what a particular youngster does i 
school. . 

2. Higher Correlations Have Been Found in Elementary c i ^ 
Than in High Schools and in High Schools Than in Colleges. 
studies have indicated a drop in correlation from perhaps .70 in e 
mentary school to .60 in high school and .50 in college. The drop i ^ 
correlation is probably to be explained by the decreased range o 
tcllcctual ability in the college groups. A relatively small percentac^ 
from the lower half of a school population go on to college, and spe^ 
cific colleges draw from an even more restricted ability range. Thoug 



IMPOBTANCE OF INDIVIDUAL DIFFERENCES IN INTELLIGENCE 247 

more and more young people are going lo college, the clientele of 
specific colleges continues to be fairly homogeneous in ability. 

3 Freviom School Achievement Has Given Correlations icitii Later 
School Snccess as High ns or Htgiter Timn Intelligence Le^e- h 
rredicline college marks, for example, high-school record has usua ly 
shown correlatiL at least as high as those resoltmg from ai scholast.c 

aptitude test at entrance. rivr Better 

A Intelligence Test nnd Achievement Combined Give Still heller 

,„o types of S Imndardized Measures 

5. Intelligence Tests Gorrela e g between an 

„l Achievement achievement battery in the .70's 

intelligence test and total score clcventh-grade 

or even .80's “1''=."°' 'j. caUiornia Test of Mental Maturity 

group the correlation between Achievement Test was found 

lui «tal achievement on the 6 it was .84. 

.0 be .71, < whereas tor a gr^ tom p^jes 5 and 7 

Another report; 8^'.“ pjne^ General Ability Test and the 

for the correlation between the B 

Metropolitan Tests Are Related lo Academic 

6. The Degree to "''"f f As one would expect, the 

Success Depends Upon completely upon the same 

more academic subjects, wbicb built so large in 

kinds of verbal and rorrelations. Thus, one summaiy 

intelligence tests, show the hi„he ^s an average cor- 

of studies in secondary and big er English grades 

relation of .46 with -'“fVbroX 7 ^'''' 

and foreign language grades but only. 

grades in domestic science. achievement 

The fact that intelligence ,bc very way in which 

and school progress is ^ be otherwise. How hese 

the tests were l^^jt^Sonal planning and m 

;rrra=r"- “ 

in the chapter. 

INTEIIIGENCE W RElArioN TO , cceomplishments and 

We turn our a, tendon achievement in the »- 

consider how intcl'i-®"' = " f that wc may raise. 

of work. There are two types 



2^3 TESTS OF INTELLIGENCE 

How do workers in diflerent kinds of jobs compare in measured inteb 
ligence? (2) Within a given kind of job, to what extent is intelligence 

related to job success? . j , t ,„lhenec 

In relation to the first question, we have a good deal of 
stemming from the testing of recruits carried out dunng World Wa 
I and II. Data for a selection of representative jobs arc sho n 
Table 9.3. This table shows the lOth, 25th, 50th, 75th, and 90t per 


Toble 9.3. AGCT Standard Scares of Occupational Groups in World War II 


Percentile 


Occupational Groups 

10 

25 

50 

75 

90 

Accountant 

114 

121 

129 

136 

143 


110 

117 

124 

132 


Lawj-er 

Bookkeeper, general 

112 

108 

118 

114 

124 

122 

132 

129 

138 

Chief clerk 

107 

114 

122 




99 

109 

120 

127 

137 


100 

109 

119 

126 


Clerk, general 

Radio repairman 

97 

97 

108 

108 

117 

117 

125 

125 

136 

Salesman 

94 

107 

115 

125 


Store manager 

91 

104 

115 

124 

133 

Tool maker 

92 

101 

112 

123 


Stock clerk 

85 

99 

no 

120 


Machinist 

86 

99 

no 

120 

127 

Policeman 

86 

96 

109 

118 

128 

Electrician 

83 

96 

109 

118 

124 

Meat cutter 

80 

94 

108 

117 


Sheet metal worker 

82 

95 

107 

117 

126 

Machine operator 

77 

89 

103 

114 


Automobile mechanic 

75 

89 

102 

114 


Carpenter, general 

73 

86 

101 

113 

123 

Baker 

69 

83 

99 

113 


Truck driver, heavy 

71 

83 

98 

111 


Cook 

67 

79 

96 

in 


Laborer 

65 

76 

93 

108 

II9 

Barber 

66 

79 

93 

109 

120 

Miner 

67 

75 

87 

103 

119 

Farm worker 

61 

70 

86 

103 


Lumberjack 

60 

70 

85 

100 

1I6 


Adapted from N. Stewart.^’ 



IMPORTANCE OF INDIVIDUAL DIFFERENCES IN INTELLIGENCE 249 

ccntifcs on /Irmy Ceneral Classification Test standard score (based on 
standardization with an average value of 100 and a standard deviation 
of 20). A marked gradient is noticed front such occupations as ac- 
countant, teacher, and lawyer to such occupations as barber, miner, 
and lumberjack. The gradient fcdlows fairly closely the educational 
requirements or average educational background for each occupation. 
In general, one may say that occupations select out individuals jointly 
on the basis of educational level and of Intelligence. NVhether intel- 
ligence enters as a significant factor excepting as it determines edu- 
cational level is more difTicnlt to determine. In any event, the net 
result is appreciable dilTercnce between different occupational groups 
in performance on intelligence tests. 

\VhiIc noticing the differences bclueen groups, one must not forget 
the substantia! range of score within each group. Individuals differ- 
ing widely in abstract intelligence function together in the same occu- 
pation. Thus, the upper 10 per cent of meat cutters did as weii on 
the AGCT as the average lawyer. The bottom 10 per cent of lawyers 
showed no more intellectual ability than the upper 10 per cent of 
miners. In spile of group differences in average score, there are still 
wide individual differences wuhin groups. 


(NULUaSNQi AND }09 SUCCESS 

What can we say about the refaijonship of intelligence test score to 
success within particular jobs? A summary of the findings reported 
in a number of different studies is presented in Table 9.4. With the 


Toble 9.4. Relorienih/p of Intelligence Test Score to Measures of 
Job Success 


Type of Job 


hfedian Per Cent 
Correlation Significantly 

with Positive Number of 

Job Success Correlations* Coefficients 


Clerical workers 
Supervisors 
Salesmen 
Sales clerks 
Protective services 
Skilled workers 
Semiskilled workers 
Unskilled workers 


.35 70 

.40 7S 

.33 »00 

-.09 ^ 

.25 73 

.55 JOO 

.20 47 

.08 3i 


Adapted from E. E. GhiselU and C. W. Brown.« 
• Significant at 5 per cent level. 


S4 

9 

A 

is 

6 

6 

45 

13 



TESTS OF INTEltlGENCE 

success is measured by supervisors' '"‘"S*' ^ “\“f ? ^^eiy to be 

by some index of production on the job, the lolhmz 

unreliable and biased by a number of considerations h ^ 
to do with the real efficiency of the worker. In ^ ^ 
no test given to the individual can be exacted to ^ 

All in all, we may conclude that (1) intelligence is re a 
pational group membership and (2) though /f , be 

iigence test score to job success is ‘Unearfa ^od 

quite low. Prediction of out-of-schooI achievement appears g 
deal less accurate than prediction of school achievement. 


INTERPRETATION OF GROUP DIFFERENCES IN 

measured intelligence 

As soon as the first intelligence tests were developed, 
started administering them to different kinds of groups and ^ ® 
group differences in performance on the tests. They . j„ 

sexes, different age groups, groups of different racial or national S • 
urban and rural groups, groups from different parts of 
groups from different socio-economic levels, and so forth. ^ ® 
ings from these studies were fairly consistent in showing , 

group differences. Lower score on intelligence tests was ^ 

with lower socio-economic status, living in a rural area, living in ^ 
Southern or Southwestern United States, being an Indian or egr^’ 
being in an immigrant family from the south of Europe, or being o 
40 years old. However, the interpretation of these findings has 
a source of a good deal of confusion and conflict. . 

The first naive tendency was to interpret group differences m 
tclligence test performance as an indication of innate hereditary 
ferenccs between the groups in question. For example, the lower 
performance of the children of laboring class parents was 
as indicating basic genetic differences between that group an 
w'hite-collar group. Now, such basic genetic differences have not s 
disproved, but many lines of evidence have made psychologists 
more cautious in interpreting group differences in intelligence 
formancc. Many studies have pointed out the role of life expenen 



tNTERPRETATlON OF GROUP DIFFERENCES IN INTELLIGENCE 25J 

in influencing lest scores and have made us realize how dangerous it is 
to make any comparison of groups whose experiences differ radically. 
We shall consider some of the relevant evidence. 

The testing in the United Stales in World War I and in World War H 
has made possible a comparison of the level of performance of the 
military recruit population in 1918 with that in 1940 to 1945, Using a 
somewhat revised edition of the 1918 Army Alpha Test with a sample 
of World War II recruits, it was possible to estimate the Army General 
Classificaihn Test equivalents of different scores on Army Alpha and 
thus to compare the performance of the two recruit populations. It 
was found that the average World War II recruit surpassed 83 per 
cent of the World War I group. 

A similar comparative study, on a smaller scale, was made of chil- 
dren in certain mountain counties of eastern Tennessee,*® When 1940 
performance was compared with that in 1930, it w'as found that the 
average IQ for children in these counties had risen from 82.4 to 92.2, 
a gain of 9.8 points. This gain paraUeJed a very considerable increase 
In accessibility and cultural opportunities in the counties in question. 

Comparisons of national groups in their own countries have failed 
to substantiate differences found between immigrant groups in the 
U. S.‘® Studies of Negro children in New York City have shown a 
tendency for the IQ’s to be higher for those children who had spent a 
longer time In New York.'* Studies of foster children have found a 
level of intelligence for these youngsters above what would have 
been predicted from the intelligence or social level of their biological 
parents.” 

All these findings point to the fact that intelligence test score de- 
pends upon experience. Where groups differ widely in experience, 
differences in test score may be expected to result. Thus, in the United 
States between 1918 and 1940 the median schooling of IS-year-olds 
increased from about 8^4 years to about lOVj years. In addition, 
radio sets appeared in over 50 per cent of the homes of the country. 
Good roads pushed out into the rural areas, so that it was relatively 
easy to get to ta»v. These are only some of the social and cultural 
changes. These changes had their impact upon test performance. A 
more educated population, exposed to more experiences and perhaps 
especiaUy to more extensive and varied use of language, did better on 
the tests. ... 

The present discussion does not negate the significance of inteifi- 
gence test differences in individuals. These differences are large even 
for individuals who have had closely rirailar environmental oppor- 
tunities. Environment and experience are not the whole story or per 



TESTS OF INTEIUGENCE 

ance of an individual, some allowance must ^ ‘ 

mental opportunity he has had. An IQ of 90 

meaning tor a Negro child who spent his early years in a share p 

pefs calin in the rural South from what it has for the son of the local 

banker. 


USING INTELLIGENCE TEST RESULTS IN SCHOOLS 

There are, in general, three types of settings i" f 
tests are used in schools, and inteUigence tests should be cons dereu 
in relation to each of these. Standardized tests may enter into ao 
ministraUve policy as a basis for administrative decisions on such ma 
ters as class grouping, promotion, eligibUity for 
ricula, and the like. Standardized tests may be used by the class 
teacher as aids to understanding the individual pupils w"* 
must deal and in making adaptations and adjustments to their 
vidual needs. Tests may be used by the guidance staff of “f . 
in planning the most effective use of special resources for dia^ 
and remedial teaching, in helping the pupil and hU family am 
sound and realistic educational and vocational plans, and in "eip 6 
understand personal adjustment crises when they arise, we may 
sider intelligence tests in each of these contexts. 


INTELUGENCE TESTS AND THE SCHOOl ADMINISTRATION 
Intelligence tests are likely to enter into the actions of the schoo 
administration either (1) through a policy of using test results 
basis for forming the group for a classroom or (2) through repi a lo^ 
specifying score levels that permit or require some special action, e.e-. 
assignment to a slow-leaming class, eligibility to take algebra, e igi i 
ity for a special school, etc. What is an appropriate attitude towa 
administrative actions of these sorts? _ 

Grouping by Intellectual Ability. The policy of forming c a 
groups at least in pari on the basis of the intellectual level of the pup 
remains a common one. In 1947 to 1948 more than half of city sc oo 
systems reporting used ability grouping in some form in one or 
schools. However, the procedure remains a controversial one. ^ 
part this is due to the varied and somewhat contradictory results o 
tained in studies of the eflects of ability grouping.® In part ^ 
to the variety of specific practices subsumed under the same labe 



USING INTIIIIGENCE TfST HESULTS IN SCHOOIS 253 

'■ability grouping” or “homoseneous grouping." m part it is based 
upon (he difTercnt initial biases of those discussing the proWem. 

It Is probably impossible lo make any sineic general evaluation of 
ability grouping that would apply to all instances of the practice. It 
can be pointed out that grouping together pupils of like mental ages is 
only a firt/ sicp to permit adapting class program and procedures to 
the abilities of the pupils in the class. What is most important is the 
adaptations that are actually made in materials and procedures after 
the grouping has been carried out— and also what attitudes exist or 
can be developed in the community toward the grouping and the ad- 
justments (hat accompany it. It should also be noted that groups 
formed on the basis of intclligcncc-icst scores will still be quite hetero- 
geneous with respect to academic skills. The correlations of intelli- 
gence and achievement, and of dilTcrcnt aspects of achievement are 
low enough so that forming groups on any one measure will still lease 
quite a range of performance on any of the others. In a departmental- 
ized program, as in high school. clTcciivc grouping in separate subject 
areas can be based on a combination of an intelligence test and a 
measure of achievement in the subject area. Though a general evalu- 
ation of achievement can be combined with intelligence test score for 
elementary school pupils. It is not possible to get a group homogeneous 
for all subject areas. 

Many of both the gains and hazards of ability grouping have been 
claimed to lie in relatively intangible areas of interest, attitude, and 
adjustment. Evaluations in these areas have generally been quite in- 
adequate. Thus, it is still largely a matter of opinion whether the 
bright child develops better work habits and leadership traits or feel- 
ings of snobbishness and superiority from being in a special class group. 

Ability grouping for the bulk of pupils is one issue, and special 
clasitcs for the relatively extreme deviate is a somewhat different one. 

How about the highest and lowest 2 or 3 or 5 per cent in intelli- 
gence? Here we must recognize that special administrative provisions 
are possible only In a community of some size. Unless there are per- 
haps 500 children per grade m the school s}stetn, there mh ooi be 
enough extreme deviates to fill a class group. The problem of the ex- 
treme deviate becomes most aimtc in the case of the low deviate, be- 
cause of the obvious problems that the slow learners have in adapting 
to the activities and tempo of a regular classroom. Special class groups 
have not been a universal panacea, but they do permit adaptation of 
the type of class activities and the rate of progress to the interests and 
abilities of the slower learners. . 

The very bright child is usually a less conspicuous problem m me 



TESTS OF INTEtUGENCE 

TeUr class. He gets the regular work done. His 

aonarent Furthermore, the alert teacher can often ^ 

than ftey would in regular classes, or engage m a wide range 
"t activities without falling behind children ■"/‘iS- “ ' i 
Fmthermore, there is no real evidence that membership •" special class 
groups results in undesirable personality attributes in 
In view of the importance of individuals of high ability 
and in view of the long period of training that most of th 
undergo to take a role in the professional groups of our 5“™ J’ 
provisions to accelerate or enrich their early training 
a sound social provision where such provisions are administratiselj 

’^TJlluigence Test Score os an Adminislralive 
gence test results enter into administrative actions 
of intelligence is specified as a prerequisite for s?™' f 

to a pupil. Generally speaking, the relationship ’"telhgen 
score to educational progress or success is low enough and he a r 
of factors involved is great enough so that rigid admimstrati ^ 

ards on intelligence seem rather questionable, '"'c'l'gc’cc « 
factor that should receive consideration, together with other • 
in arriving at a decision with respect to any individual. ^ . 

flexibility of action is needed, in the light of all relevant fac or . 
administration should formulate general policy with respect to 
of intelligence tests for admitting pupils to special groups, 
policy should be one which permits actions on individual cases 
tahen in the light of a variety of relevant factors. 

INTEtUGENCE TESTS AND THE CLASSROOM TEACHER 
The classroom teacher will want to use intelligence test ^ 

an aid to understanding each pupil in the class and to 
school experiences that will be most helpful to that pupil. ^ 

level as measured by an intelligence test provides probably the 
single clue available to the teacher as to the child’s potentialities 
learning the abstract symbolic aspects of the school curriculum. 
test results provide a guide as to what can reasonably be expec e ^ 
each pupil: whether the pupil should be expected to move along 
rapidly as the rest of the class, whether the pupil’s achievemen 
falling enough behind expectation to suggest the need for special 
nostic or remedial procedures, or whether the pupil’s abilities a 



USING INTELLIGENCE TEST RESULTS IN SCHOOLS 255 

enough ahead of those of the bulk of the class so that the teacher 
should try to provide special activjttK and opportunities for enriching 
the regular program. 

There arc certain cautions that need to be observed when the class- 
room teacher makes use of tntell^ence test scores for his pupils. An 
enumeration of the pitfalls may help the reader to avoid them. 

1. The general intelligence test, especially the group test, is a meas- 
ure of ability to work with symbols, abstract ideas, and their relation- 
ships. This is one quite limited type of ability. The test does not 
encompass ability to work with things or people, or perhaps the ability 
to sohe many types of concrete and practical problems. The child 
who is low on an intelligence test will probably hav-e trouble with the 
academic aspects of the conventional school curriculum. However, 
he may have a good level of skill or ability in the many non-abstract 
aspects of living — mechanical, social, artistic, musical. The teacher 
should seek these strengths, capitalize upon them, and build upon 
them. Above <j//, the teacher t/mu recognize that intelligence test score 
is not a measure of personal worth and must avoid rejecting the child 
whose aptitude for academic pursuits is low. 

2. The verbal group Intelligence test that is ordinarily used for 
school-wide testing is sufficiently dependent upon reading and arith- 
metical skills that a low test score must be interpreted cautiously for 
a poor reader or low achiever in ariihmctical skills. If possible, in- 
dividuals of this sort should be tested also with an individual test or 
a non-verbal group test to determine whether the low performance is 
due to limited ability, or whether it Is a rcfieciion of limited reading 
and number skills. 

3. Intelligence test results for a child whose social and cultural back- 
ground differs radically from that of the rest of the group should be 
interpreted with caution. The possibility of some degree of environ- 
mental deprivation should be borne in mind. 

4. If it is known or suspected that a diild was emotionally disturbed 
at the time of testing, results should be considered quite tentative. 
Motivation and effort are needed for sound test results. 

5. Tbc standard error of measurement should always be very real 
to the test interpreter. An 10 of 90 should always signify to the 
teacher “JQ somewhere between 80 and lOO. 

rNTELt/GENCe TESTS AND THE GUIOANCE STATE 

Imdligcncc lists have thiir most obvious function in the ciiucaliona; 
program as sourees of information important to persons responsible 



tests of INTElllCENCE 
, 0 . counseling and helping 

Ihe pupil's intellectual abilities as one aspect of the tota p 

“".ri:"— I— ■“ “S?.5"- 

especially imponanl. This infomation should 
consideration in deciding svhat is an appropr.ate edueat onal ob - 
for the pupil; he., vshether to plan for college and tf so the MM 
college to plan lor, or what type of high school curricu 
In vocational counseling, more specialfeed ' pt 

kinds we shall consider in the next chapter, are 
ment to the general inlelligcnec test, but these 
not so important for educational planning. -rVis 

who is having problems in school, whether with his schoo ^ .;_i 

personal adjustments, an estimate of his intellectual level ^ 

As we have indicated elsewhere, individual tests and non-I g 
tests are highly desirable supplements to the usual group test 
any reading or language handicap is suspected. :„,,iiiiTence 

The specific situations and circumstances under which int S 
tests may be used in guidance arc so many and varied that they 
each be discussed here. Some further consideration is given to 
m the guidance program in Chapter 18. 


SUMMARY STATEMENT 

Teste of ability include tests of achievement and of 
Thoueh aptitude tests usually depend less directly upon specific 
ing than do achievement teste, it must be recognized that any test p 
formance is in some degree a function of the individuars 
of experience. Aptitude tests are distinguished at least in part by 
function — to predict future accomplishments. 

Among the most thoroughly explored and widely used aptitude 
are tests of intelligence. As these have been developed, they ten 
emphasize abstract intelligence, the ability to deal with ideas a 
symbols, and may even be thou^t of as scholastic aptitude , . 

The two main patterns of tests have been group teste and 
teste. Group teste, resembling the short-answer achievement te • 
format, are much more economical to use and are satisfactory for 
purposes when the examinees are normal groups of school age or o 



257 


REFERENCES 

However, ihe individual tests have a number of advantages and are 
useful particularly with ( 1 ) young children. (2) emotionally disturbed 
cases, and (3) cases with special educational disabilities. 

^Special tests have been developed for infant and preschool croups, 
for groups with educational and language handicaps, and for groups 
from varied cultures and social classes Tliese may be of practical 
value in special cases, though they serse more often as research tools. 

Intelligence test results for school-age children are about as reliable 
as any of our psychological measurement tools. The widely used indi- 
vidual tests such as the Stanford-Binet /niedigence Scale and the 
Wechsler InieUigence Scales are probably somewhat more reliable than 
the typical group test, though the differences are not large In spite of 
the high reliability, appreciable differences may be expected between 
one testing and another. 

When intelligence test scores arc studied in relation to achievement 
in the world, the most clear-cut relationships are with academic 
achievement. However, It is also true that there arc substantial dif- 
ferences in test performance between persons in different types of jobs- 
Furthermore, success in at least some types of jobs has been found to 
be related to the abstract intelligence measured by our tests. 

Group differences In Intelligence (l.e.. sex, race, age differences) 
must be interpreted quite tentatively, in view of the differences in back- 
ground for these different groups. Houeter, individual differences in 
intelligence are important facts, which we need to use wisely in help* 
ing individuals in their adjustment to the world of the school and of 
work. 


REFERENCES 

1. Anderson. J. E., The limitations of infant and preschool icsU in the 
measurement of intenigenw. A Fsycftol , 8, 1939, 351-379. 

2. Arthur, Grace. A point scale of performance tests, 2nd ed.. New yorx. 

Commonwealth Fund. 1943. ..... .k / -..-.ii; 

3. BayJey. Nancy. Consistency and vanabdity m 5’^'*/®!^’' 

genee from birth to eighteen years. /. genet. Psychol, i5, 1949, 165- 

4. Qark. W. \V., Questions and answers regarding the California Test of 
Mental Maturity, Los Angeles. Cafifomw Test Bureau, 

5. Cornell. Ethel L-, Effects of abaity grouping dcterminaWc from pu^ 
lished studies, in The grouping of pupffs, A'fli. Soc. Study Edue., 

6. ScrrG'’F%'i:Tb;'^and A. H. Canter. The rel.ahiliiy of tij 
Wechsler-Beilcvuc subtesfs and scales, J. consult. Psychol. 14, 
172-179. 



258 tests of INiailGENCE 

7. Durost, W. N., and G. A. Prescott, An improved method of compar- 
ing a capacity measure and an achievement measure at the elementary 
school level. Educ. Psycho!. Meas., 12, 1952. 74 J -75 1. 

8. DurrelJ. D. D.. The influence of reading ability on intelligence mcas- 
ures. 7. educ. Prv’cfto/.. 24, 1933, 412-416, 

9. Ebert, E.. and Katherine Simmons, The Br\jsh Foundation study of 
child growih and development, I, Psychometric tests. Monogr. Soc. 
Res. Child Devetpm., 8, No, 2, 1943, 

10. Franzblau, R. N., Race differences in mental and physical trails. Aren. 
Psycho!.. 1935, No. 177. 

11. G«ell, A., el al.. The first five years of life: A guide to the study of the 
pre-jchool chUd, New York. Harper. 1940. 

12. Ghiselli, E. E., and C. W. Brown. The effcclivcness of intelligence tests 
in the selection of workers. J. appl. Psychol., 32, 1943, 575-580.^ 

13- Goodenough. Florence L.. Measurement of intelligence by drawings, 
Yonkers, N. Y., World Book. 1926. 

14. Goodenough, Florence L.. and Katherine Sf. Maurer. The men^l 

growth of children from two to fourteen )ear5; a study of the pxedtz- 
tive value of the Minnesota Preschool Scales. Vniv. Minn. Inst. Child 
Welf. Monogr., No. 19. 1942. , 

15. Goodenough, Florence L., Katherine .Nf. Maurer, and M. J. Van 3\ag- 
enen, Minnesota Preschool Scales: Manual of instructions, Minneapo- 
lis. Minn.. Educational Test Bureau. 1940. 

16. Honzik. .Marjorie P.. Jean W. McFarlane. and Lucille Allen, The sta- 
bility of mental test performance between two and eighteen years. J* 
exp, Educ.. 17, 1948. 309-324. 

17. Justman. J., A comparison of the functioning of intellectually gifted 
children enrolled in special progress classes in the junior high school, 
unpublished doctor's dissertation. Columbia University, 1953. 

18. Klinebcrg, O.. hlegro intelligence and selective migration. New York, 
Columbia University Press. 1935- 

19. National Education Association. Research Division, Treads in city 
school organization. 1938 to 1948, Res. Bull.. 27, 1949, 4—39. 

20. Raven, J. C.. Progressive matrices. London. H. K, Lewis, 1956 (U. S- 
Distributor, Psychological Corp,). 

21- St. John, C. W., Educational achievement in relation to intelligence 
as shown by teachers* marks, promotions and scores in standard tests 
in certain elementary grades. Harvard Univ. Stud. Educ.. 15, 1930. 

22. Skodak. Marie, Children in foster homes: A study of mental develop- 
ment, Vniv. la. Stud. Child IFel/., 16, No. 1, 1939. 

23. Stewart, Naomi. A.G.C.T, scores of army personnel grouped by occu- 
pations, Occupations, 26, 1947, 5—41. 

24. Stutsman, Rachel, Mental measurement of pre-school children, with a 
guide for the administration of the Merrill-Palmer Scale of Mental 
Tests. Yonkers. N. Y„ World Book, 1931. 

25. Terman, Lewis M., and Maud A. .Merrill, Stanford-Bmet Intelligence 
Scale. Manual for the Third Revision. Form L-M. Boston, Houghton 
Milhin, I960. 



QUESTIONS FQ« DISCUSSION 259 

26. Thorndike, R. L.. The prediclion of tatdbsena^at entrance 

27 Sentm. R D . SoldTer .Senigeice tn WoNd Ware I and II, A,„er. 
2S, Site Da'-d:’H>eLtr,dd„^^ Wl.gence Scale, New York, Pay- 
29 wSie?; Da^d,7l'ccl„ler /nteffigence Scale /nr C/nldrenr 
3„: ^i?:ej:“tKt^;"Sh^nl»i„udhse^.e..T=n„ea. 

see monnlam children. I. edne. Paycho/.. 33, 194., 

SUGGESTED ADDITIONAL READING 
Bayley, Nancy. On .he grotvrh o, inlell.gence, drncr. pryCnlnyl.t. .0, 

— P C W. «n. 

CrSVc?"L” l’;:’S-« n, ,.,Wn, 2nd ed.. New Vo, , 

Harper, >S™. Comparative piyel^lif^' 

“TNcgroc^andT^il” m^he ui'sialer Piycltnl. S»».. 3T, 5, 1960, 
Heif Ke“„^ne.K Walter, e. al., I«;^- nod cdiiirn, dt.crencer, ChicaRO. 

University ot ChtcaRo Press, I 3rd ed . Chicago, 

’’'Mince Sse°a'rch rerrorcli, 3rd ed.. 

"“S Z vUSi I.«rNefv‘ VoVk,- Springer, 

Miner. John B-, IntflUs^’^ce m rfte vnue 
1957. 

OUESTIONS FOR DISCUSSION 
B has been proposed ^a, nlHn.ell.^- f ™“ S'oy.i'.bil 
- ^::::::;o^=Eendo..;s-^»^ 

telligence test or an m tv. ^ i„p=dimen,. 

h! ?ou - :S; .. edocatlonal 

'• S vlS'E'hndy o. Ihe Mealcan children in a school system 

d. You are making institution, 

in Arizona. of delinquents m a 

e. You are working with a group 



tests of INTEttlGENCE 

4. I-hichof.hefono.ing—s ^ 

Animr Point Scale rather than the Slanford-Bmel. wny 

1 For tesung Puerto Rican ehiWrer. ff ’"8 

b. For selecting children for a special f 

c. For evaluating intelligence in a school for »'= ““■■ 

d. For studying children who have reading problems. 

5 What are the implications for child placement agencies of the data on 

\^7d??wo dmer^nfillligence tests given to the same pupil quite 

f",t=ra“:up tests more useM for guidance for 

professional occupations or for skilled oecupations, VWy. 

8. A news article reported that a young woman "ho Jiad been 

ted to a mental hospital with an IQ of 62 had ““e “ ra«e heM^^ 
to 118 during the 3 years she had spent there. be- 

this news statement? ^Vhat factors could account for the diHerencc 

tween the two IQ’s? . ^r-hnnX crades 

9. In what respects are intelHgence tests belter than hjgh-schoo g 
as predictors of college success? In what respects are 

10. Why do intelligence tests show higher correlatjons with stand 
achievement tests than they do with school grades?^ 

11. Comment on the statement: “College admissions officers *” • 

count scholastic aptitude test scores of applicants who come from i 
economic groups.” _ , sntelU- 

12. You are a fourth-grade teacher. You have given a group 
cence test to your class and gotten IQ's from it. 'Vhal additional m 
tion would you want to have on the pupils. What sorts of speem 
and plans might grow out of the test results? 

13. An eighth grader has received the following IQ’s on the ^ 

Thorndike Intelligence Test, Verbal: Grade 4 — 98, Grade 6 11 • . .. 

g — 102. \\'hat would be the best figure to represent his “true scno 
aptitude? . • . jij. 

14. A school in a prosperous community gave Stanford-Binel it\ 
gence tests to all entering kindergarlners and all first graders who ha 
been tested in kindergarten within the first week or two of school, 
desirable and useful a procedure h this? Why? 



Chapter 10 

r 

The Measurement of Special 
Aptitudes 


The tests that we reviewed in Chapter 9 were tests of general mental 
ability. In most cases they resulted in a stngJc score that represented 
an over-all appraisal of the individual’s ability to deal with abstract 
ideas and relationships. However, we found that some of them did 
produce two or more scores of a more specialized nature that were 
designed to provide more specific and analytical Information about 
the individual, i.e., the verbal and performance lO's of the Weclister 
scales. The concern for specific information on more restricted seg- 
ments of the ability domain has led to the development of test bat- 
teries and single tests to measure specialized aptitudes. It is these tests 
that we shall consider in the present chapter. We v\j]l direct our atten- 
tion first to batteries and tests designed for vocational guid.tncc and 
vocational selection. Then we will consider specialized tests for prog- 
nosis and prediction in special school subjects and in special types of 
schools. Finally, we will take a brief look at tests in the specialized 
fields of art and music. 

VOCATIONAL APTITOOC BATTERIES AND TESTS 

One of the early practical concerns of psychologists was in guiding 
young people into the types of work in which they would be happy 
and .succes.sftil and in selecting for an employer those men who would 
be efficient and satisfied in the jobs that he ss'as trying to fill. As psy- 
chologists began to study jobs, it seemed apparent that different ones 
required dilTcrent special abilities as well as diflerent Icstls of general 
mental ability. The automotive mechanic required a good deal of 
mechanical knowledge, but Kule \crbal fluency, while the l.i«jer 
needed verbal comprehension but not mechanical skill. The book- 
keeper needed good ability with numbers, while the wati^makcf 
needed fine coordination In his finger mosTincnts. The ability re- 
set 



262 THE MEASUREMENT OF SPECIAL APTITUDES 

quirements of jobs appeared to dllTer along a number of specialized 

dimensions. 

At the same time, research demonstrated that human abilities are 
to some degree specialized. This has been shown in studies of the 
correlations between different tests. Constder the correlations shown 
in Table JO.l between six tests of a battery used for classification of 


Table 10.1. Intercorrelolions of Selected Air Force Aptitude Tests 



I 

2 

3 

4 

5 

6 

1. Reading Comprehension 


.50 

.05 

.23 

.13 

.11 

2. Navigator Information 

.50 


.16 

.25 

.17 

.15 

3. Numerical Operations 

.05 

.16 


.44 

.27 

.11 

4. Dial and Table Reading 

.23 

.25 

.44 


.39 

.23 

5. Speed of Identification 

.13 

.17 

.27 

.39 


.43 

6. Spatial Orientation 

.11 

.15 

.11 

.23 

.43 



men in the U. S. Air Force.® Note that the correlations between the 
first two tests are relatively high. These are both tests that are quite 
verbal in nature and they appear to define a factor of ability to deal 
with verbal relationships. Tests 3 and 4 are both numerical tests and 
are substantially correlated. Tests 5 and 6, which correlate substan- 
tially with each other, both involve speed of visual perception. Note 
that the correlations of tests I and 2 with 3 through 6 are quite low. 
The verbal tests are measuring abilities quite different from those 
measured by the other four. The numerical and perceptual tests are 
not as clearly distinct from one another, but the correlations of tests 
3 and 4 with 5 and 6 are less than the intercorrelation of 3 and 4 or 
the intercorrelation of 5 and 6. Thus, it appears that our six tests 
measure three somewhat distinct abilities t a verbal ability measured 
by I and 2, a numerical ability measured by 3 and 4, and a perceptual 
ability measured by 5 and 6. These abilities are not entirely inde- 
pendent but are tied together, perhaps by a common clement of gen- 
eral mental ability running through all of them. However, the three 
arc sufficiently different to justify separate measurement of them. 

There has been a large volume of research on the organization and 
structure of human abUities during the last 50 years. Much of it has 
employed a technique known as factor analysis to try to tease out the 
underlying mental factors. Factor analysis starts with a table of cor- 
relations such as we have shown In Table 10.1 (usually, however, a 
much larger table) and tries to identify the pattern of underlying fac- 
tors that could have produced the observed relationships. The tech- 


VOCATfONAl AI>TITUDE BATTERIES AND TESTS 263 

niques arc computationally laborious and statistically involved, and 
\\c shall not po into them in any detail here • Wc shall report merely 
that the research has indicated that one can distinguish quite a number 
of spcc'is) ab'iliiy faewre. such as wrbal comprehension, word fluency, 
numerical fluency, perceptual speed, mechanical knowledge, spatial 
Visualizing, and inductive and deductive reasoning. It is also true that 
most of these abilities arc to some degree related to each other. The 
tests of general intelligence discussed in the last chapter reflect a pool- 
ing of several of tliese separate factors, together with accentuation of 
their common core. 

Through theoretical research on the nature of abilities on the one 
hand and the applied research on the validity of specific tests for spe- 
cific jobs on tlic other, psychologists have been guided in the design 
of aptitude test batteries for use in educational and vocational guidance 
and in personnel selection and classification. Since about 1940, these 
baticrJes have come to occupy quite central positions in the testing 
scene, so we will need to study them In some detail. First, we will 
CTaminc two of the most widely used batteries, one oriented primarilj' 
toward school use and the other toward industrial use. Then we will 
review some of the evidence on v'alidiiy and consider the advantages 
and limitations of a battery of this sort. 

T/ie Digerential Aptiiudf Test Battfry. This battery w,as produced 
by the Psychologicat Corporation in 1947 as a guidance battery for 
use at the secondary-school level. Some attention was paid to getting 
measures of separate and rckaiively uncorrelaicd abilities, but the main 
attempt was to get mc.isHfCs ih.nt would be meaningful to high school 
counselors. As a result, the intcrcorrclalions of the tests, with the 
exception of a test of clerical speed and accuracy, arc about .50. 
How’ever, the rcli.ibilities of the separate tests average about .90 and 
are enough higher than the test InlcrcorrelatSons to assure us that each 
test mcas\ires ‘abilities somewhat distinct from those measured by the 
others. The eight subtests arc briefly described and Illustrated below. 

1. Verbal neasatiins- arc of the double-analogies type, te, 

? is to A as B is to 7. Two sets of answer choices are provided and 
one must be picked from each set Id complete the analogy. 


Example 

h to wide as thin h to 
1. siore 2. n.itrow 

A. fat B- weight 


3. nothing 
C man 


4. street 
D. present 


• For an inlrodneiory c.pn.ilioi. of fmor J« Coil ford, ). V. 

metric Methods. New York. McGraw-Hitf Book Co, J954. 



254 the measurement OF SPECIAl APTITUDES 

2. Numerical AbiUiy. Consists ot numcticnl problems eiTiphas>z,ng 
coraprebension rather than simple computational facility. 


Example 


A Hi 

11 ® 

i ^ i = C J". 

4 ■ 8 0 2 

E none of these 


3. Abstract Reasoning. 
relationship or sequence, 
continues the series. 


A series of problem figures establishes a 
and the examinee must pick the choice t a 


Example 




A B C D E 


4. Space Relations. A diagram of a flat figure is shown, 
examinee must visualize and indicate which solid figure or figures cou 
be produced by folding the flat figure. 


Example 


1 0 0 8 0 0 

A B C D E 


5. Mechanical Reasoning. A diagram of a mechanical device or 
situation is shown, and the examinee must indicate which choice 
true of the situation. 






VOCATIONAL APTITUDE BATTERIES AND TESTS 


26S 


Example 



6. Clerical Speed and Accuracy. Each iienj is made up of a num* 
her of combinations of symbols, one of which is underlined. The ex- 
aminee must mark the same combination on his ansvter sheet. 


Etample 

Tjst Hems Swip’c ef Arnr^tr thttl 



7. Language Usage; Spelling, A list of words is jriven, some of 
wirich are misspelled. The examinee must indicate for each word 
whether it is correctly or incorrectly spelled. 


Example 

Right Wrong 

U \i 


diflnatc 





266 


THE MEASUREMENT OF SPECIAL APTITUDES 

8 Lansmge Usage: Sentences. A sentence is given, 
nn^ot ielrors fn usage or punctuation^ The -.nee .s^d.v.ded 
into subsections, and the examinee must indicate ail the 
contain an error. 


Example 

Ain’t we/ioing tothe/o«ice/n€xt week/at all. 

abode 


Sample of Answer Sieet 


A B C D il 

11 li 1 1 l| 


The tests of the DAT are essentially power tests, with the exceptio 
of the Clerical Speed and Accuracy Test, and time limits are in mos 
cases 30 minutes. Total testing time for the battery is about 5 to 5 /2 
hours, and requires at least two separate testing sessions. ‘ 

norms are available for each grade from the eighth through the twe ■ 
Norms are provided for each of the subtests, and also for the coni i 
nation of V and A, which may be used as a general appraisal of sc 
lastic aptitude. An illustration of the profile form on which resu s 
may be plotted is shown on p. 152. 


The General Aptitude Test Battery {GATB). The General Apti- 
tude Text Battery was produced by the Bureau of Employment e 
curity, U. S. Department of Labor, in the early 1940’s. It was base 
upon previous work in which experimental test batteries had been 
prepared for each of a number of different Jobs. Analysis of the more 
than 50 different tests that had been prepared for specific jobs in 
cated that there was a great deal of overlapping among certain ones o 
them, and that only about 10 different ability factors w'ere measum 
by the complete set of tests. The GATB w’as developed to provi e 
measures of these different factors. In its most recent form it inclu e 
12 tests and gives scores for 9 different factors. One is a factor of ge^ 
eral mental ability (Gl, resulting from scores on three tests (Voca » 
lary. Arithmetic Reasoning, and Three-Dimensional Space) that are 
also scored for more specialized factors. The other factors, and t ® 
tests that contribute to each are described below. 




VOCATIONS APItlUOE SAnESiES AND TESTS J47 

I'criiDj Aptltmle. Score is based on one lest. Number 4, WocaMary. 
This test requires the subject to identifj the pair of words in a set of 
Tour That are either synonyms or antonyms. 

Ejcamplet 

a. caulious b. friendly c. hostile d. remote 

a. hasten b. deprive c. expedite d. disprove 

Numerical Ability. The appraisal of this aptitude is based upon two 
tests. The first of these, Number 2, Computation, involves speed and 
accuracy in simple computations with whole numbers. 

Examples 

Subtract (-) 256 Multiply (x) S7 

82 8 

The second test entering into the Numerical AbWiy score, Number 6, 
Arithmetic Reasonins, involves verbally stated quantitative problems. 

Example 

John works for $1.20 an hour. How much is his pay for a 3S-hour 
week? 

Spatial Aptitude. One test. Number 3, Three-Dimensional Space, 
enters into appraisal of this aptuude. The examinee must indicate 
which of four 3*dimensional figures can be produced by folding a flat 
sheet of specified shape, wiih creases at indicated points. 

Example 

Example of Spabal Aptitude 

A B C D 

rorm Perception. This aplilmlc involves rapij amt accurate per- 
ception of visual forms anil pallctns. It h appraiscil in the CrirtT by 
l»o tests. Number 5, Toot J/otr/iins. and Number 7, P orm Mtttehmg, 
which differ in the type of visual stimnlos provided. Each requires 





JJ3 the measurement of SPECIAl APTITUDES 

fl.e examinee to find from among a set of answer choices the one that 

is identical with the stimulus form. 

Examples 

Tool Matching: 

A B C D 

c:^ 

Form Matching: 


Clerical Perception. This aptitude also involves rapid and areurat 
perception, but in this case the stimulus material is linguistic insica 
of purely spatial. The test, Number 1, Name Comparison, presents 
pairs of names and requires the examinee to indicate whether the tv-o 
members of the pair are identical, or whether they differ in some detai . 

Examples 

John Goldstein & Co. — ^John Goldston & Co. 

Pewee Mfg. Co. — Pewee Mfg- Co. 

Motor Coordination. This factor has to do with speed of simple but 
fairly precise motor response. It is evaluated by one test, Number . 
MorJt Making. The task of the examinee is to make three pencil mar 
within each of a series of boxes on the answer sheet to yield a simp 
design. The result appears approximately as follows: 

0 0 0 0 

Score is the number of boxes correctly filled in a bO-second test period. 







VOCAnONAl APTITUDE BAHERIES AND TESTS 269 

^^awal Dexterity. This factor involves speed and accuracy of fairly 
gross hand movements. U js evaluated by two pegboard tests, Number 
9, PJace, and Number 10, Turn. In the first of these, the examinee uses 
both hands to move a scries of pegs from one set of holes in a peg* 
board to another. In the .second test, the examinee uses his preferred 
hand to pick a up from the board, rotate it through 180®, and re- 
insert the other end of the peg in the hole. Three trials arc given for 
each of these tests, and score is the total number of pegs moved or 
turned. 

Fhiser De.rierit}\ This factor represents a finer type of dexterity 
than that covered by the previous factor, calling for more precise finger 
manipulations. T<vo tests, Number II, Assemble, and Number 12, 
Disassemble, use the same piece of e<]uipment. This is a board with 
50 holes in each of two sections. Each hole in one section is occupied 
by a small rivet. A stack of washers Is piled on a spindle. During 
Assemble, the examinee picks up a rivet rvith one hand, a washer with 
the other, puts the washer on the rivet, and places the assembly in the 
corresponding hole in the unoccupied part of the board. He assembles 
as many rivets and washers as he can in 90 seconds. During D/jat- 
sembie, he removes the assembly, returns the washer to its stack, and 
returns the rivet to its originaJ place. Score is the number of items 
assembled or disassembled as the case may be. The apparatus tests 
arc aJJ arranged so that at the completion of testing the equipment has 
been returned to its original condition, and is ready for the testing of 
another person. 

A comparison of the GATD and the DAT brings out that the DAT 
has tests of mechanical comprehension and language which the GATB 
Jacks, while the GATB includes form perception and several types of 
motor tests that arc missing in the DAT. Thus the GATB is more 
work oriented and less school oriented in Its total coverage. Inclusion 
of the several types of motor tests results In somewhat lower correla- 
tions, on the average, for the GATB, though the "intellectual tests 
correlate about as highly as those of the DAT. The correlations 
among the different aptitude scores of the GATB arc shown in Table 
J0.2 for a group of 100 high school seniors. Excluding the correla- 
tions with G. which involves the same tests appearing in V, N, and 
S, the correlations range from —.06 to .66. The three motor factors 
show fairly marked correlations, but they are practically unrelated to 
the remaining tests. TIic perceptual and intellectual tests also show 
quite a bit of relationship to one another, and this is most marked be- 
tween the two types of perceptual tests. 



270 


THE MEASUREMENT OF SPECIAL APTITUDES 


Table 10.2. InJercorrelolions of GATfi Aptitude Scores for 
100 High-School Seniors * 



G 

V 

N 

S 

P 

Q 

G-Intelligencc 

V-Verbal 

73 






N-Numerical 

74 

42 





S-Spatial 

70 

40 

34 




P-Form Percept. 

43 

34 

42 

48 



Q-Clcrical Percept. 

35 

29 

42 

26 

66 


K-Motor Coord. 

-04 

13 

06 

-03 

29 

29 

F-Finger Dext. 

-05 

-03 

-03 

01 

27 

20 

M-Manual Dext. 

-06 

06 

01 

-03 

23 

16 


* Decimal points have been omtitcd. 


There are quite substantial correlations between the corresponding 
factors of the DAT and the CATD. Representative values from one 
study • are as follows: 


Verbal 

.70 

Numerical 

.56 

Space 

.69 

Clerical 

.56 


However, the correlations are low enough so that it is clear that the 
tests cannot be considered identical. One important difference is the 
fact that the DAT tests arc in most cases purely power tests, while the 
GATB tests are quite highly speeded. 

Other Aptitude Batteries. A number of other aptitude batteries 
have been produced, mostly since 1950. There is generally less infor- 
mation available on these than on the DAT or the GATB, so their 
usefulness is less fully established. The batteries are briefly described 
in Appendix HI.* 

There are also a good many single aptitude tests. Many of these 
^ much like the tests that have been described as components of the 
D^r or GATB. The batteries have, of course, usually adapted ideas 
from the most effective single tests and incorporated measures that 
have been successful in previous use. Thus, the Bennett Mechanical 
Comprehension Test was the predecessor and model for the DAT 
Mechanical Reasoning Test. The Minnesota Vocational Test for 


reports on each of seven different batteries, together with an evalua* 
« the Personnel and Guidance Journal 

separate monograph enmfed The Vse of MullUactor Tests in Guidance. 



VOCATIONS APIIIUDE »A7I£SIES AND TESTS 27, 

m ihQ GATB. The various early mechanical aptitude and clerical tests 
have been reviened by Bennett and Outcfcshank.^^ and of course more 
recent tests will be found reviewed in the Mental Measnrementj Year- 
books. 


VALIDITY OF APTITUDE BATTERIES 

Now we must inquire into the usefulness of aptitude batteries such 
as the DAT and the GATB. We must inquire to what extent such a 
battery can provide us information that permits us to make better, 
more varied, and more dilTeremiatcd predictions than those that are 
possible from a test of general mental ability or scholastic aptitude. 
The types of predictions with which we are most likely to be con- 
cerned are predictions of success in specific school subjects or major 
fields, predictions of success in specific jobs for which the individual is 
an applicant, and predictions of success in general fields of the world 
of work. 

Differential Prediction of Academic Success. We have seen that 
scholastic aptitude tests have fairly good over-all validity for predict- 
ing academic success. One thing that we might hope is that an apti- 
tude battery would tell us in which subject areas a student is most 
likely to be successful. Will Walter do better io English or in mathe- 
matics, in science or In French, in mechanical drawing or in history? 
A battery can do this to the extent that different tesls in the battery 
are valid for different subjects. To what extent is this the case? 

The manual for the DAT provides extensive data on the correlations 
of each of the subtests with achievemerit in a number of school sub- 
jects, Some of these results are summarized in Table 10.3. This 


Toble 10.3. Medion Correlation of Differential Aptitude Test Scores with 
School Grodes in Different Subjects 




Xlsthe- 

Cntlvh Siwn* 


Eludir*. Liul* Bhort- 

Ui«ocT TypiK h»nJ 


Vprhal Ilrtsonmi! <^ R) 
Niimpriftil Ability (NA1 
AHtruel Rt’n.Mioia (AR) 
f*!'"*".* RHiitmtis (SR) 
Mt^ilian-rol Raa-ionin? (MR) 
ClTical Sjiwd A .AM«rB«y 
(CaA) 

Sr'Iting (fpplf ) 

i^rotrnrw (fii-nt J 


.48<a> nxt) 

.S9(S) 4SW 
jr(i) J2W 
.lurs) J2tf) 


MO) HK» 
JKi/ ofSf 
.440) AMi) 
3MU) JCISJ) 

avt) s‘(*t 


.JDfO 1S(6) 
«U) .»5(0 
KtS) 27|S> 
I3t*J .IS(7) 
.IRT) IKS) 


K!8) 
27(0 
JKM 
tei«i 
14(7 S) 


.J«74> JW **<») 
itO) A«S> **<«** 
.S2P) aaw 


»(«J1 *778) .J«43> 14<75) 

sa(j> lltJ) 25(4 S) .55(1) 

4«(J) .40(2) .aill) 


of (hat 


(bat anb/'cl. 



IHE measurement of SPECIAl APTITUDES 
table shows the median value of the correlations and also ranks the 
suXis with respec, lo their correlalions the 

'The first thing that we notice is that certain subtests arc S 
hichest for almost all the subjects. Thus, yerbal ^,1 

th'e top for aU subjects except tv-pmg and , 3 ia 

except shorthand. The Senteneer test is one of the hrjic ” 
for all subject areas. This means that in large part the abiUMs 
underlie academic performance arc general abilities, 
general scholastic aptitude test mil be cflcclivc 

^c authors of the D/f T have recognixed this by P""'.'"' and 

Verbal Reasoning and Numerical Abilily tests as a singl I 

preparing separate norms for them. The combination of these 
provides an effective measure of general scholastic aplilu -. _ 

At the same time. Table 10.3 docs show some indication of d 
ential validity. The Mechanical Reasoning Test is 
science than for the other subjects. The Spelling Test 
own in predicting success in shorthand. The Numerical 
is more valid for mathematics than it is for English. P .,jjg 
does have a modest amount of differential validity, and ooes P 
some suggestion that a pupil is likely to be “ ‘..t, 

field than in another. However, it must be admitted that 
educational guidance a general measure of scholastic aptitua 
prove quite serviceable, and a battery of specialiied aptitude tes 
make only a limited additional contribution. 

Prediction of Specific Job Success. We may next ask how s“cc« 
ful a battery of aptitude tests will be in predicting the success of _ 
ers in a specific job in a specific company. Will the tests ‘ 

ties high enough to make them useful to employers? Will * ^—0 
tests predict success in different jobs? The manual for the 
provides quite an array of validities for job criteria. The data a^ 
short of being ideal because the validation is often concurrent, as ^ 
upon men already employed; because the samples are small, au ^ 
the sample is tj-pically limited to workers in a single plant or com^n> , 
and because there is rarely any independent cross-validation.* ® 
ever, they provide about as good a pool of data as we have in w i 
a common battery was validated against criteria of success in a 
ber of different jobs. We have abstracted from the original 
those instances in w’hich validities are available against job (as disli 
from school or training) criteria for samples of as many as se>e ) 

• Especially in exploratory studies in «hich a battery of tests is being 
out. it is important to serify validities discovered In an initial study by chec 
tbe same tests with a new independent sample. 



VOCATIONAL APTITUDE BATTERIES AND TESTS 273 

cases and display them in Table 10.4. Only those correlations are 
shot™ in the table that are of a size that would be unlikely to have 
occurred by chance.* 

Toble 10.4. Validity el GAIB Scores lor Specific Occopotions 


Cbemt»f» /<. _ 

Coraiiinitor, >liM 
ami Mathire 

UbrvffT. rtiuUry 
Machiniit ‘} 

Mounter ”* 

Polury Drcnratar 7'» 

Hrmiimon. Cylin-lff 
S««lnK M*<h Op 

T«l«i1 ilin* 

Opcfslcif 

TtlrplioBf Opetalor »« 

OiKlrrt»ritrr 

encouraging values “f' (o, different jobs differ. Thus, 

clear that the factors important for a number of ““'"'“y 

manual dexterity appears to P ^^,^,i„ly heavily for 

and production line perception is relevant o 

?he riachinlst and chemist's 4“''™'; general intelligence dis- 

the printing trades and to "^""“Wr.etc. In so tar as thes 

criminates the good from the p«rs^^J^__^^ '"to 

sincle companies are pre 

lecdon of aptitiKie of aptitude tests in relation 

" .argu uathi^ny ris.i«-. »■ - - -■ 

* Correlations are exhibneo 




274 THE MEASUREMENT OF SPECIAL APTITUDES 

Selections from his report are shown in Table 10.5. 


Table 10 5 Selected Data on Average Volidlty -f Different Sorts of Tests 
for Different Cotegories of Job (Adopted from Gb.sellr) 





Type of lob 


_ 






Trades 


Type of Test 

Super- 

visory 

Cleri- 

cal 

Sales 

Pro- Vehicle 
teciive Operator 

and 

Crafts 

Indus- 

trial 

Intelligence 

Arithmetic 

28 

20 

31 

26 

02 

(06) • 

27 H 

(15) (04) 

20 

23 

20 

13 

Spatial 

Relations 

21 

10 


(11) 

19 

14 

Name 

Comparison 

30 

(-15) 

(24) 

(20) 

16 

Mechanical 

Principles 

24 



(27) 21 

40 

(50) 

Finger 

Dexterity 


24 


(19) 

20 

18 

Arm 


(18) 



15 

21 

Dexterity 







• Correlations based on less than 500 cases are placed in parenth 


an average, often of a number of correlations. The correlatio 
been enclosed in parentheses when they are based on less a 
hundred persons. For some combinations of test and occupa i 
data could be found, so these entries have been left blank. 

The pooled correlations reported by Ghiselli rarely go a ove 
and then only for the smaller groups. Correlations in the jjjy 

fairly typical. For a given category of job, the variation m 
from one type of lest to another is rather modest. Thus, these r 
present a rather less optimistic picture of the value of tests o spe 
aptitudes than that portrayed in the GATB results in Table 10-^ 

The less promising picture may stem in part from the 
suiting from combining quite a span both of jobs and of tests 
a single cocfricicnl. It may be, however, that the larger 
cases represented in Ghisclli’s composite correlations are less li 'C > 



VOCATIONAl APIIIUDE BAnERlES AND TESTS 2T5 

>"rri;rxs.-;==-; 

*SS=;==="“ 

ness of aptitude test „„ ,his problem is one 

Probably the most ^ j oridnally been given an 

in which t'PP™’''''"f '5'.'°;““ ™i„' ,hc Air Force during World War 
extensive battery ot =P"‘“f te time of testing." Test 
II, were follotved up some f in an occupation and 

results were related to entry ^ in ibat occupation. 

SL=;»9rS;»eontras^^ 

shaTJly wifh'‘those'’reported 

convincing evidence of oiiy „,„d a speeiBe occupaMn. 

an occupation for those , 7 " »''> ™„,'as often negative as positive 
Correlations were J quite possibly have arisen as a 

ro^ie'’rrJ and quite ^ 

the table shows data for «»= 



276 the measurement OF SPECIAL APTITUDES 

is set equal to 100. Thus, a score of +50 Hes of 

staudard deviation above the cadet 

several of the occupations are shotvn graphicallj i » 

From Table 10.6 and Figure 10.1 we can see that th re 
stautial diflereuees between one occupation =‘"'1 ^^-..hest on 
these make good sense. The accountants as a group a - 
numerical ability, while the architects are highest on visua 
Engineers are highest on the general mtellectual j„ 

large in success in engineering school, while machinists . “ 
mechanical and psychomotor skills. Some profiles ^ 

peaks and hollows, as, for example, the ones for 
chinist. Others are quite flat, exemplified by the sales engi 


Aaoufitanl 

0 

General ' ->^>1 

Number 

Visual { 

Mechanical {///' /' 

Motor ^ 


Arehrtect 



3 

a 



Machinist 



Fig. lO.t. Ability preliUi for Four oceupdiofti. 



VOCATIONAI AftlTUOE BAnEMES ANO TESTS 377 

Table 10.6. Abilily Profiles ol Ocoipolional Groopi * 

croup 


Accountants and 
auditors 28 
Architects 
Artists and 
designers 
Bricklayers 
Carpenters 
College 
professors 
Contractors 
Dentists 
Draftsmen 1 

Drivers, bus 

and truck -53 

Engineers. 

chemical lOo 

Engineers, 

civil "^5 

Engineers, 
sales 57 

Farmers, ^ 

general 


44 

- 7 
-24 
-44 

75 

- 7 


54 


- 5 
-17 

38 

-10 

20 

-14 

-11 

42 

31 

33 


Lawyers 
Machinists 
Mach, ops., 
fabricating 

39 

-35 

-45 

-’6 

-25 

Managers. 

credit 

- 5 

22 

Managers. 

office 

Mechanics, 

vehicular 

Physicians 

4 

-72 

59 

33 

-65 

20 

-21 

Plumbers 

-42 



- 4 
74 


51 

-38 


3S 

-10 

15 

31 

-23 

30 

56 

35 

-29 

- 7 
4 

-25 

25 


10 

24 

-33 

34 

19 

14 

-14 

19 

36 

39 

38 

-42 

31 

-39 

-27 


-32 
- 1 

1 

5 

I 

15 

-20 

20 

14 

40 

-36 

-21 

32 

9 


plumber pruflics. ^ 

pencnrlly high »"> 

liccably both in Ihc Itrc' " 

and type p' spwWifP"™' 



„8 the measurement of special aptitudes 

The differences sho™ in “bewem lccI.p’’atS*''' TT* 

resent a minirautn estimate of diff b screened by a test 

is because the Air Force group _ Jallv able men who would 
of general abdily, so that the less “ “killed and nn- 

ordinarily have been heavily T^PT“=” hand, it roust 

skilled occupations had been screened out On he ° ^ be 

always be remembered that ’’*'i'?''f,„tfnr"eores irM 

shown beween group means there is st.U a ' bility higher 

each group. Some artists will be found " ‘"^her mechanical 
than the typical accountant, are 

ability than the typical engineer. Differences between P 
real, but so is variability within occupations. 

PROGNOSTIC TESTS 

One group of aptitude tests is made up of tests 
readiness to learn or probable degree of success in ^ a 

ject or segment of education. These are called ^ have 

group of tests in this category that have been widely 
received considerable use are the “reading their 

tests are designed to be used with children, „ indica- 

enir>' into the first grade, to give the school as accurate 
tion as possible of the child’s ability to progress in ^ps 

provide information the teacher can use in assembling 
within the class, in deciding upon the amount and type ot p 
activities to provide, and in judging how soon to start a jj 

program. In some communities where kindergarten atten 
quite general, tests at the end of kindergarten are looked to 
basis for organiring finl-gradc groups for the following 
sorts of tasks that appear in these tests may be seen from T a 
The reader who compares the tasks in Table 10.7 with t e 
intelligence test items shown on pp. 222-225 will be aware o ® 
stantial degree of similarity. In both, knowledge of -ce’s, 

appears. Both deal with recognition of sameness and di e 
with analysis and classification. However, the reading 
tend to emphasize more exclusively the materials of reading, 
and words. They include the components or early stages of t e 
inc task. The basic question now becomes; Docs the 
which is given in the reading readiness test result in increased ' a i 
Is the special test an improvement over a measure of general or 
dcmic aptitude? This is the question that must be raised for any 
of prognostic test or special aptitude test. 



PROGNOSTIC TESTS 


279 


Tabic 10.7. Types of Tasks IneWed in Represenlalive Reajing 
Readiness Tests 

Lee- Metro- Murphy- 

T„,e of TCS. Task Gn-« «nek pnOn.n Darrell 

Oral vocabulary or dircc- 
lions, using pictures 
Rhyming or matching 
sounds 

Visual matching of figures. 

letters, or words 
Visual perceiving of figures, 
letters, or words ("Which 
one is difTcrent? ') 

Learning words in a stand- 
ard lesson 

Ability to read letters and 
words 

Whether a reading readiness remains somewhat 

reading suceess than does a ■ indicates that tests re- 

unclear. One fairly f„ds. 10 eomplele a story, and 

quiring pupils to „dlai„„ of reading achievement 

to select rhyming words ga P sumford-Einet mental age. The 

one, two, or three terms Retinas Test, developed 

validities repotted for the Oof imn/wd- 

on the basis of this rcscarc , study. This 

B/,i« MA showed a ' Jsely resembling the tasks faced 

would indieale that the lest m effectiveness, 

by the beginning reader <in ha™ h -e ,„e 

Another set of data '• J^er correlation with s.iith-giade 

nfnsliam IiMlIlgawe ^ ^^CtorA Reading 

reading achievement than be considered eontradie 

However, these two sets o ndertakes to predict ability to p 

for Tile reading readiness test „,ed to forecast 

fr^ reading instruction m the near t „,s, „ „ more 

Sate level of ryding SSg „'uZ 

efeeuve as an indicator of po^e^ ^ ,„d,eator of 

levels, and the last lew years have 



280 THE MEASUREMENT OF SPECIAL APTITUDES 

cLl subject area than is P°“i“e rom a general measure of scho^ 
aptitude. However, one may still quest.on whether, "■I"™ 
of academic achievement, special prognostic tests “n imp joJutencc 
dictions based upon a combination of measures of 6'="^' ® ,ify 
and previous academic achievement in fJnicienUy 

their use. The demonstration that they can has not been sulhcien , 

impressive to result in widespread adoption of the tests -„jictors 
Special prognostic tests seem likely to be more useful a P « 
of success in rather special types of aimdemic tasks have h^ 
counterparts at earlier levels of school expenenee. Thus, J 
ShortMnd ApMude Test, for which a 8 5 up- 

achievement in shorthand has been reported, may ha use ul as r 
plement to other information about the pupil in ''’a'“a«"= P'“ 
Lcess in shorthand training. The ERC Smosraphe 
and the Bennett Stenographic Aptitude Tests have given comp 
results. These tests include such tasks as spelling, transcnbing ) 
bols, dictation under speed pressure, and word discrimination. 

PROFESSIONAL-SCHOOL APTITUDE BATTERIES 

One other group of aptitude tests, so-called, are the tests that hare 
been developed to select individuals for particular types of P™. j, 
training. Many types of professional schools, sometimes individu 
but more often operating through their professional organizations, 
instituted testing programs for the selection of their students. ^ 
programs are in operation for selecting students for engineering,^ 
medicine, dentistry, veterinary medicine, nursing, and accounting, 
mention a few. 

The tests used in these professional-school batteries tend to be 
of reading, quantitative reasoning, and apprehending abstract rela 
ships, with the balance and emphasis shifted somewhat to 
the academic emphasis of the particular training program. 
largely minor variations upon the same theme — a relatively high* 
measure of scholastic aptitude and achievement. The different p^^^ 
fessional aptitude tests would correlate very substantially 
another or with a measure of general intelligence, and, indee , 



MEASUREMENT OF MUSICAL APTITUDE 2«' 

should be expected that they would because the abilities requited to 
succeed in training for the diifetent professions have much in common, 
m sMlaAies outweigh the ditfereuees. The common core is 
Santed to the professionai field, as by giving more emphasis to quan- 

ment. 

measurement of musical aptitude 

. When we come to such fields -"'bradest S" 
measures of aptitude becomes qu by general measures 

of appraising them. is executive or motor, the 

In musical ability one P required for playing an instru- 

ability to master the this domain, perhaps 

ment. Aptitude "’ff 'S„tor iusttument. Most measute- 

mrLlecri:etdm-/.h= per^pU- -d interptetive aspects 

“'S^g music involves in 

disctimination-diseiim.nalion o p . ,1,5 more complex 

lions. It involves in the seimnd f„,,,i„„ships, the pattern of 
musical relations in “’.'."'’‘''f^^^lhe relationship of a harmoj 
a melody, the composition of a h ^s about the suitabd- 

,0 a melody. Third, it -"™ «,ttriony, a thylhmic pa.teru, or 
ily and pleasingness of a meiooy 

a pattern of dynamics. musical aptitude test battery, tn 

me most thoroughly mvKtiga d priutaiily tow^d 

Seashore Measures 0/ though with some at.ejmou 

measuring simple sensory d«eniu. msts have analyted 

.0 perceiving remains. Thus, there are 

music down so far tnat y 

following subtests: . ^ tones is higher. 

1, O.crh,haa,,oaoir^-j;^^^^^ of two souads is 

2 . Discrirninalion oj Lo 
louder. 



JJ, the measurement of special aptitudes 

3. Discrimination oj Time Interval: judging which of two inten-als 

T°Jujgmenl of rhythm: judging whether two rhythms are the sam. 
“men. of Timhre: judging which of two tone qualiUes is more 

'’'T'Tonul Memory: judging whether two melodics arc the same or 
difTerenl. 

The items are on phonograph records, with a senes of 
type. Wthin each type, the judgments become progressi , 

difficult- ^ s^Hovc 

The analytic approach to musical aptitude is evident in 
lUt of subtests- Critics have contended that the analysis has «• 
the tests a great from any genuinely musical matenal an 
fine discriminations of pitch, time, and intensity are really 
for in the activities of the musician- Validity studies of the " 

tests have been somev.hai conflicting, yielding appreciable co 
with measures of musical success in some instances and yer>’ ^ , 
lations in others. The value of the anal>l)C lest b still a ma 
doubt and controversy. 

Contrasting rather markedly with the Seashore of tes 
Wing Standardised Tests oj Musical InteWgence. These e 

oped in England, were designed to stay as close as “ 

actual raateriab of music. The following subtests are include • 

1. Chord Analysis: delecting the number of notes in a single 

2. Pilch Change: detecting the direction of change of one note 

repeated chord. m-lodic 

3. Memory: deteaing which note b changed when a snort m- 

phrase b repeated. . 

4. Rhythmic Accent: judging which of two performances o 

same piece has the better rhythmic pattern. . 

5. Harmony: judging which of two harmonies b more appropn 
for a melody. 

6- Intensity: judging which of two playings of the same piece has 
more appropriate pattern of dynamics. _r,rtv 

7. Phrasing: judging which of twn renditions has the more app^ 
priate phrasing. 

Thb test b made up in part of tests that call for perceiring 
relationships and in part of tests that call for esthetic choices in 
musical material. Information on the s-aliditj- of the test is stul 



TESTS OF ARTISTIC APTITUDE 283 

ited, but what there ^ iScheiV^ankinEs in 

C s^Tstpi”. “'.t1h;™in.u'in=d in future studies = test 

like the Wing test would appear to haw a very '".f 

of young people who have musical aspirations or whose famihes hold 

such aspirations for them. 

TESTS OF ARTISTIC APTITUDE 

Severn, types of tests "o"uC^ 
the first place, there have been Judgment Test. Each 

is now fairly well , „w,s One is an aclnowl- 

item consists of a P“ masterpiece systematicaUy 

edged masterpiece, fbe other choose the better 

"n ""'helest Sank indicating the respect in which the 

'™Tetor.L*Sgmen.atas,««o,a^ 

MgmentTest. The members 

consist of abstract and i.e., balance, symmetry, 

judgmental, aspect of art. M ^ „n certain bnuting 

require the subject to P™''““ fnventnry, n pattern o 

"Bivens." Thus, in the material the examinee nmst 

Ites and dots is provided, and io,2. Wc 

Siis^sss 

SIS25Si=»'- 



284 


the measurement of special aptitudes 



fig, 10.2. U^mpU of typ* of llom t« Horn Art Aptifo* tr,r»r,lorr. 


ance is reasonably prcdiclivc of later art-school success, 
and Smith * found a correlation of .66 between score on the o 
at the beginning of the year and average faculty rating of g. 

a special high-school art class at the end of the year. Barrett c 
lated four art tests with grades in a ninth-grade art course an 
ratings of pupils’ art products, with the following results; 



Course 

Ratings of 


Grade 

Product 

Me A dory Art Test 

.10 

.13 

Meier Art Judgment Test 

.37 

.35 

Knauber An Ability Test 
Lewerenz Fundamental An 

.33 

.71 

Abililies Test 

.40 

.76 


Thus the last two tests, requiring production of drawings by t 
aminee, had about the same correlation with grades as did the ^ 
Art Judgment Test but much higher correlations with apprais s 
student products. 




REFERENCES 

W= can see from the above that the test tasks that require art stu- 
dents to do the sorts ot tasks they will be taught to do m art P«- 
dict their later achievement. How far down to untramed pup.ls this 
rnn he nushed remains to be determined. 

Since^the keying of art tests of alt types depends upon “ ^ 

iud^en* obtLfng a high score requires conformuy » ^ ^ 

that a highly talented but unconventional person P 

the tests. 


SUMMARY STATEMENT 
TTrough general intelligence tests bear some — ‘P 

in many fields, efficient ,he Abilities called lor by 

calls cor tests more f 'f '“'’P”'’ 

^tS'Sfrev'^.u^^e readings to^und^erta^^^ 

tional tasks have also been prognostic tests have been 

ate reading readiness tests. J„c,ion is reasonably « I 

less widely used, P'*“P* S apthude and academic achievement 
served by measures »' appear to be variations upon 

Professional school aptitude 

basic theme ot pr^uced a number ot ability tes rt. 

The fields ot music “"d art have p ,„ace^sf„l. 

However, highly analytic ndmisture ol previous ttain- 

More eomplca tests '"va''' J jy;, a„d may P'o'idc an im- 

a' " 

proved and at least ^ 

hence, promise in the ncia. 

references 

,. Bartett, H. ?.. 

-■ Vo'ik. p.yah*E'”>“'^ - 



286 


THE MEASUREMENT OF SPECIAl APTITUDES 
3. BcnncH. G. K., and Ru.h M. 1^;“' 


,SS'“5=:isrs»";v- 

’ MI New York. Psychological Corp,. ipS"- . Porecs 

5. D. C.. 

Aviation Psychology Rci^^ ^o. 

U. S. Government Printing J-'„ ' ,, if, ., 1 ,^, 1 % of deiermininR 

- s umvershy. 

versity of California PuHicalions in Psychology. 1955. bl-I. IP 

8. Guide lo the me of the general apultide '[^‘’^“"wL’hil'S- 

reau of Employment Sccuruy. U. S. Department of Lanor. 

9. hL°C^ A.'.'ani L. F. Smith. Hie Horn Art Aptitude Inventoff. /• 

appJ. Psyc/ioL 29, 1945. 35(^-355. r„„ Readiness Test: 

10 Lee. M. J.. and W. W. Clark. Lee-Clark Read, n?: Readme 

Manual Los Angeles. Calif., California Test Nc>v 

U. Thorndike, Robert L., and Elijabeth P. Hagen. fO.OOO career 

York, John Wiley. 1959. , , *« investication 

12. Wing. H.. Tests of musical ability and appreciat.on. An invw 
into the measurement, distribution, and « 

pachy, Brit. J. Psychol Monogr. Suppl, 8, No. -7, 194b. 


SUGGESTED ADDITIONAL READING 

Froehlich, Clifford P., and K. B. Hoyt. Guidance testing, 3rd cd.. Chicago. 

Science Research Associates. 1959. Chapter 6. . ^ 

Harris, Chester W., Editor. Encyclopedia of educational researc , 

New York. Macmillan, 1960. pp. 59-62. 1084-1085. hoIoQical 

Super, Donald E., Appraising vocational fitness by means of pp'ca 
tests. New York, Harper. 1949, Chapters 4. 6, 8-11, and 15. 

Super, Donald E., The use of muliifactor tests in guidance, 'vasm _ 

D. C., American Personnel and Guidance Association. 1958. 


QUESTIONS FOR DISCUSSION 

1. A number of aptitude test batteries have been developed for 
the secondary-school level, but almost none for the elementary sc 
Why is this? Is it a reasonable slate of affairs? 

2. What are the advantages in using a battery such as the „ 

Aptitude Tests instead of tests selected from a number of different sour 
What are the limitations? 

3. Step by step, what would need to be done to set up a program 

selecting students for a dental school? . q 57 

4. How could a high-school counselor use the data of Table 
What are the limitations on the usefulness of this material? 



QUESTIOKS FOR DISCUSSION 287 

5. How might the counselor use the data of Tabic 10.2? What are its 
limitations? 

6. How sound is the statement. *The best measure of aptitude in any 
field is a measure of achievement in that field to date '7 \Vhai arc its limi- 
tations? 

7. What are the cfifTcrerjces between a reading readiness test and an 
intelligeace test? \V?iat are the advantagjrs of using the readiness test 
rather than an Intelligence test for first-grade pupils'^ 

S. To nhat extent are tests hire the Horn Art Test and the ll'/n? Muuc 
Test measures of aptitude’ To what ej«enf are they measures of achio’c- 
ment? 

9. What factors tend to make tests of arlisljc and musical aptitude 
somewhat Jess useful than other types of aptitude tests? 

J 0. In what ways could a follow-up study of graduates of a high school 
help in improving the school guidance program? 

11. ^\'hy have aptitude test batteries shown up better m discriminating 
between jobs than in predicting success wii/»n a single job category? 



Chapter 11 

y 

Acliievement Tests 


standardized versus teacher-made tests 

We turn now to tests of school achievement. Wc shall h' 
particularly with commercially available standardized 
we shall need to consider other types of appraisal devices i “ 
see how the standardized lest fits into the total picture. A 
cated in the previous chapter, the distinction between an ap 
and an achievement test is a somewhat blurred on;- Howe^e 
be interested now in measures of knowledges and skills that 
tied to organized school instruction, and in measures that a 
used primarily to appraise present status in those school-taught 

edges and skills. . \n the 

Standardized tests do not represent anything new and strange 
measurement of academic achievement. They are blood ^ 

the short-answer teacher-made tests that were discussed in Chap 
They are made up of the same types of items and cover hiany ° 
same areas of knowledge. In what ways, then, do they differ 
teacher-made tests? What arc the advantages and limitations o e 
For what purposes should each be used? 


DISTINaiVE FEATURES OF STANDARDIZED TESTS 
There are four main ways in which commercially distributed 
ardized tests differ from the tests that the individual teacher wou 
prepare for his own class. 

1. The standardized test is based on the general content and 
lives common to many schools the country over, whereas the teac 
own lest can be adapted to content and objectives specific to his o' 
class. 

2. The standardized test deals with large segments of knowledge 

skill, whereas a teacher-made test can be prepared in relation to an> 
specific limited topic. , 

3. The standardized test is developed with the help of professio 

28S 



ACHIEVEMENT TESTS 

2. Compare achievement ol different skills or in different subject 

"'1' Evaluate the status of pupils front different school or classes on 
a common basis, as when a pupil transfers to a new schooh 

4 Make comparisons between different classes and school . 

5 Study pupil growth over a period of time to see wh P 
is more or less rapid than might be expected. 

For some purposes, such as pupU diagnosis, we 

only standardized and teacher-made tests but a variety 

testing and observational procedures as well. 

We® see, therefore, that standardized and teachcr-inade t b 
have important functions to perform in the educattonal 
a large Lent they are different functions. The two types of 
tion supplement one another. They are not competitors. 

Standardized tests of achievement have been imoLible 

cally every subject in the school curnculum. It would be imp 
to gWe even a brief treatment of all subject areas in 'he pages tha 
be allotted to achievement tests in this book. L Lrly 

fore, to concentrate on tests in a single area, in 'h^ belief ha 
full treatment of this area will serve to introduce almost all the m 1 
problems and issues that would be encountered m dealing "'"h 
in any area. We have chosen to discuss reading tests for two teas • 
In the first place, these tests ate more widely used than those in 
other subject area. In the second place, a great variety of both s J 
and diagnostic procedures have been developed in this area, so t a 
shall get an introduction to a wide spectrum of testing techniqu^^ 
Readers especially interested in tests in other areas can get 
evaluations of available tests in their field from Buros’ Mental ^ 
urements Yearbooks (see Chapter 8). 


ANALYZING THE OBJECTIVES OF AN AREA: 

THE FIRST STEP 

Before nvc can proceed intelligently with either the preparation of 3 
test in an area or the choice of one from among those already existinfM 
v.e must analyze and define clearly the objectives of our instruc 
in the field in question. 

Is test A a good test? Good for what? NVhat are we trying to pr*^ 
ducc in children? What do we want or need to evaluate? Only ^ 
we have in our minds a clear answer to these further questions can 
ansvicr the question; Is this a good test? An analysis of objecti' 



ANALYZING THE OBJEaiVES OF AN AREA 291 

helps US to identify the strengths and weaknesses of any single test or 

any complete evaluation program. 

\Vhen we look at our statement of objectives, we will see some f 

some that seem obviously maccess.ble by '« . 

the foUotviug statement ol objectives m the field ol reading. 

Eisentid Knowledge. Alii, odes. Skills, ond Procedures in Pending 
, Basic al, nudes, skills, and procedures involved in securing meaning 

1. Accurate and meaning- 

■■HSi-fSaerw-r. 

pJJSn oMUrmSgs ol ssord. inm « chain ol relaled 
, kCnilion ol .he impohance and rclalionship ol .he ideas 

2.Co;S'g“s:cc.sslully»i.hsuchlae.nrsas: 

S;S»ro<-ences.rue.ure. 

c. Abstiact ideas. . ,,1 its broader conlesl. mis 

H. TO .^.Jinuss. and slgnlfieanee 

;:sia, 

s Ihe passage. „ completeness ol .he 

3. Judging the accuracy 

sions. c Cray "The measurement 

. Adapted iron, Greene. K-P-* »' 

o, under'’., andins P«r 1. .«6. 

tional Society far the Siuay t 


292 


achievement tests 

4. Recognizing whether or not the reasoning of the author B 

sound. . , 

5. Identifying and exoerience so that 

F. To integrate the ideas acquired “’f’ pr':''>““* i„„edi- 

the follovsing evidences of understanding may be note 
ately, or later: 

1. New insights arc acquired. 

L Previous understandings are reaffirmed or modified. 

3. Challenging problems are solved. 

4. Rational attitudes are acquired. 

5. Behavior is modified. 

6. Interests are broadened. ...i^evr! 

7. Richer and more stable personalities arc y j-,. 

11. Supplementary attitudes. skUls. and procedures essential 

lent-rcading activities. 

A. To locate needed information. 

1 . Using an index. 

2. Using a table of contents. 

3. Using the dictionary. 

4. Using card files. 

5. Using reference books. ^ ^ u t- u. « orven ourpose. 

B. To gather and evaluate informauon m the light of a given p 

1. Recognizing the purpose to be achieved. 

2. Applying appropriate fact-finding techniques. 

3. Sorting essential from non-essential information. 

4. Judging the validitv- and significance of prob- 

5. Organizing the information m terms of the purpose r 
lem. 

6. Draviing tentative conclusions. 

7. Deciding when the purpose has been achieved. „„_,oses. 

C. To adjust reading attitudes and procedures to different p^^ ^ 

1. Modifying interpretative processes in light of the purpos 
be achics'ed. As. for example, 

a. Reading to answer factual questions. 

b. Reading an organized body of material to report. 

c. Reading to determine the accuracy of the facts or es 
scribed. 

2. Adjusting rate of reading to the purpose. 

As we scan this list of objectives, we see some that can 
be measured quite readily and directly. Objective I C has to do ^ 
perceiving and comprehending words. Discrimination of word o 
(I C 1 ) is appraised in tests for the primary grades by various tj'^ 
word-picture or word-word matching tests. Accurate jierception^ ^ 
form and meaning (I C 2) and association of form and 
(I C 3) are central to the conventional vocabulary tests. One 
entering into paragraph comprehension measures is accurate pe» r 



ANALYZING THE OBJEOTVES OF AN AREA 2M 

tion Of words in context (1 C 4). Fluent 

contributes to rapid reading and ts appraised mdireeUy in 

" t toben, raost of Ute coutpoueuts under 

an adequate understanding of what is read) can PP 

well-designcti test of “"’P'*“X„t1hrreaning of words in their 
questions can be asked not on y relationship of parts 

LtextCIDla),botalsoaboutthe.^^^^^^ 

of the passage (I D Ic) and explicitly stated, but 

(ID 3b). 'A.,„denteanbeea11edonnot 

also tor implied eonelusions (I D jhj accuracy 

merely to comprehend the “"1'" * „4„„s of a chain of reasoning 

f. E Atu^fo'ideSfT^he presence of propagandistie techniques 

‘'^loih the better reading ”d of1vafuati°ngS 

ceiving symbols, of getdng objectives of 

interpreting this meaning, no tea jj.ennine the pupils re- 

reading. Thus, a reading >«< “" *'“ J„, ..Wch he does in fact 
sponsiveness to ''’'me experience (1 F). Though certain 

integrate reading into his “t „,td into a lest (II A), a 

components of study skills student actually uses these 

test will be less useful in appraising how 

skills to solve an ‘""'‘“'“'Z'the'appraisal of the extent to which the 
Even more difficult fading he will select, as distinct 

individual will read and the type oi 

from the extent to which h' i„g ,o make tour main pomts 
In this discussion we ha ^ 1, „ a„y segment of the sc 

These points refer not only >“ '““'"P 

program. The points are ^ ^ 

“r A specific existing test will provide an app 

any test procedures. „f achievement "^‘1“"'” ^'°™,c„t its 

4 The evaluation of any to wha 

of ymiir ohjectwes o^^„mes that you a- seeking to 
content conforms to inose 



294 


achievement tests 


SURVEY READING TESTS 

One of the most widely used types of standardized test is 
readinn test. Because of the importance of reading skills in P 
of the school program, many schools make 
these skills as a basis for planning group activities 
dial action. As was suggested on pp. 291-2 2. reading is a complex 
and far-reaching enterprise. The commercial " j,, m 

to appraise only certain aspects of this range 

a nuler of the better-known and more -idely distributed reading 
tests ate listed, showing the grade levels for which forms are • 

the types of subtests that are included in each, and certain o 
of practical interest about each. 

The subtesls most frequently included in survey reading 
word knowledge and paragraph reading. The test of paragrap 
ing usually involves paragraphs of some length with questions 
upon each, though the pattern of a missing word or words to be F 
plied is sometimes used (Stanford) and the technique of . . 

reader to identify the word which spoils the meaning has also been 
(Science Research Associates). The paragraph with questions se 
to correspond most naturally to the normal reading task. 

The paragraph-with-quesiions pattern still leaves room for a '' 
range of variation in the processes that are actually tapped by the 
Only a critical examination of the single test items will enable the 
tential user to tell how many items call merely for knowing a 
meaning, how many call for selecting the particular meaning whic 
the context, how many for the answer to a specific factual question 
that is answered in the passage, how many for an inference based 
information given in the passage, how many for getting the main i 
or theme of the passage, how many for sensing the author’s moo 
purpose, and how many for recognizing literary devices used m 
passage. Reading with understanding is all these things and 
The different components are represented in different tests in very ‘ 
ferent proportions. The potential user of a reading test must 
the actual test items to get a real understanding of what abilities 
lest is measuring. And thb is true not only for reading tests. lo an) 
area of achievement, it is only as the potential user examines critic ) 
the individual items on the lest that he can judge whether this is a va 
test for his purposes. 

Fortunately, these different skills are all positively correlated, so 
that a survey based on some one combination of the skills will tend to 



SURVEY READING TESTS 295 

rank pupils in Tairly much ihe same svay as a survey based on a ralher 
different balance of them. The ehild svho does vsell on a test made up 
of directly factual items will in most eases lend to do wcl on one in- 
volvins items of inicrenee and synthesis. But the J 

from ocrfect The potential user must examine each test in which 
LTTs in.?csmd, bearins in mind the specific types cojnprf ension 

skills that he deems important, in order to judge ^ ; 

ihe using school. 

MEASURING SPEED Of READING 

An additional factor that en- in m eompiicam the ^a^pm^^ ^ 

reading achievement is the fact P area, but perhaps 

complicating factor in ^,,3, „,e„, do we want speed 

especially so in the case of Sn„ „e wish to penalize the person 
per sc to enter into us), 3 -ood deal if we give him 

IV. I-" 

genenlly speaking good („el of performance, 

arate measurement of *'’"‘'^',3 „,ide enough time on any test 
In measuring level, the obj onoortunity to progress as far as his 

so that each individual has ha an op^ time to try all the 

ability permits. ^ ^ ended in difficulty, that he has 

hems on Ihe test or if the hems ,„3eeed 

had time to woik along to the p 

with any of the test tasb. approximated. Tests must 

In pructieal testing, this goal can on y P ^ m,a a school 

be given with n definite time limitjf they ap„d,,,o„s 

program. Furthermore, it d«s nm makej^^ g , 

10 have lime limits so long a power test are umally 
■mg. Time limits for a Jest mmt of the hem . 

figured so that most of the pupd interested in speed for its 



ACHIEVEMENT TESTS 

a number of tests have undertaken to include separate measures of 

-^e“ment of speed presents - - ^Pf^^rte^d cT^ 
start a group of pupils reading an extended passage At 
rittes'rveLp them and tell then to ^ 

ing when the signal was given. But how do k"™ , Ird for word, 
read the intervening material? Some may rood d jord^ 

Others only skimmed, and others just read parts, 

the passage" really mean? make the task 

Various devices have been tried m the attemp 
more uniform for different examinees. .'7" , j r«l, 

;o,™ Silen, Reading Tests and the Trader * able to 

the reader is instructed to read m such a way tha 
answer questions. The Gates Reading Survey for Grades 3 
“ry brW paragraphs, each ending with a -oHiplejCho. e 
and score is the number of questions answered „e^ 

The Michigan Speed of Readtng Test includes in e m 
unit one word that spoils the meaning of the unit, 
cross out these words. Any of these devices is only P?'^'^ ^al 
Any device that doctors up the actual text tends to distort th 
pr«ess of reading. Yet we cannot rely upon “ “”d.ts. 

bring about comparable care and thoroughness by ' io„s 

. The reading speed test is at best very dependent “P®".'*’' ' . ^”don 
given to the examinees and provides only an approximat . jjj 
of the relative speeds at which differcDt people can read wne cr 
of care and comprehension arc uniform for each person. 

SUMMARY STATEMENf 

The survey test is, then, a sampling of the tasks that 
particular skill or knowledge. Only certain aspects of the tola 
are represented, and the balance is a little different in each tes . 
survey achievement tests have reasonably satisfactory reliability. 
main problem is to pick the one that, in its content, provides t e 
balance of skills for your particular objectives. 


DIAGNOSTIC TESTING 

A survey achievement test undertakes to provide a 
appraisal of status in some area of knowledge or skill. A 
test undertakes to provide a detailed picture of strengths an 
nesses in an area. Funhermore, it is anticipated that this c 
analysis uill suggest causes for deficiencies and provide a gui 



DIAGNOSTIC TESTING 297 

lary of common words bui no sLiUs f 

that he is unable to blend '^^lerctbSons. and that 

recognize the sounds that mir\g%, together with 

he makes frequent reversal errore.^ ,.„edial teaching of word anal- 
others, provide the basts for p -jiy directed toward Johnny’s 

ysis and phonic stills that f J^fufLu invls .»o steps (1) 
deficiencies. Development o ^ jjading, multiplying fiac- 

analysis ot the complex P=''“™" ^„mpo„ent subsktlls and (2) 
‘S;;- “o^rp^iir stilK rr as a, possible horn 

nostic tests." In a anyesU^^^ ay, 

all score is diagnostic. Even naaeraph comprehension, the 

one for word knowledge '^aUohnn^ showed better ability 

test makes it possible for us to say *at to ^ 

in word knowledge than ^ is, atler aB, a matter of de- 

certainly one diagnosttc flying degrees of thorough- 

cree. We may probe and jj,„ any test purporting to be 

Lss and detail. We “"“adequate are the “'! 

diagnostic: How „ overstate the value of the dtag 

that this test provides. particular test. . 

nostic information P'™*'' Uoublesome dilemma. 'J 

tesurprovlde'ru2ientdiagn«^ 

?r:weh°h: »mpre0--;>;f a'^uate re- 

-ir""':::^.,.,nr.importamin.ag^^^^ 

’^"‘'*1“'’'^^ «• V. i" U i” Pet-a. strengths 

two reasons. ,»,e i/idiwrfiwf- averaces or 

every instance mtereste ^ concerned. Gr p ^'ntext. 

and weaknesses ” j „„ particular ‘",“1,5 chance errors 

r?ar.^Tbaa“upon a^mge- Wan^^^^ aP 

in measuring a parheular pupU- 



ACHIEVEMENT TESTS 


i'/a 

the specific individual. This is made more acute by ” 

arc dealing with diScrences between the -TTris 

related tasks We are interested in making such a statemcn 
pil «ty to pick out the main idea in what he has 
than his ability to answer questions on specific factual details. 
"a«es are quite cUy related. How reliably can we meas- 
ure the differences between the two? 

At this point, the student could well refer to the discuss P ^ 
files on pp. 147-152. A set of diagnostic scores is a sprafic in 
of a profile. AU the issues about the reliability of a j,, 

that were raised in that discussion apply very acutely 
diagnostic tests. Since we are dealing with different ° j 

field, correlations between tests are likely to be fairly substa 
the loss of reliability to be considerable when we have to mm 
differences. One would think, mis being so, that authors of la^ 
tests viould have been particulariy concerned about the relia i > 
their instruments. But, alas, this has not generally been the case.^ - 
temperament that becomes excited about problems of 
pears to be dilTerent from the temperament that grows concemea a 
issues of reliability of measurement. It must be confessed that in m ) 
cases the reliability of diagnostic tests is quite modest and that in m . 
others it is unreported. _ . 

All this means that diagnostic test results must be interpretea 
caution. The tests protide some rough and quite tentative hypo 
as to the individual’s strengths and weaknesses. But these ' 

clearly recognized as tentative ht^polheses and nothing more, 
test profile suggests i>ossible causes for the present difficulty an 
jumping-off place for remedial work. If the remedial activities 
successful, well and good. If not, the remedial teacher mi^ 
ready to review his hiTroihescs and to explore other leads. Diagnostic, 
test results are suggestions, not commands. ^ 

We find several tj-pes of diagnostic instruments in reading, and Ih^ 
serve also to illustrate the varieties of diagnosis in other areas. I® 
first place, we find tests with somewhat specialized subtests yieldm- 
scores for some aspect of the total function. This Q'pe is well fllu^ 
trated by the Jowa Silent Reading Tests {Advanced Level)- 
have the following sublests, each supposed to represent a some'* 
different aspect of reading skill. 


Test 1. Rate and Comprehension of connected prose. . 

Test 2. Directed Reading of connected pro^ to locate answers to a 
tual questions. 

Test 3. Poetry Comprehension, inclduing mood, metaphor, etc. 



DIAGNOSTIC TESTING 


m 


Test 4. IVord Meaning m different content areas, 
hen^ng ...nUal «...s tey w„,d. 

How many of these are m fac, both 

dfflerent to be usefully diagnostic ts a re q p Coin- 

reliabilities of rest d, 'i"“ *'= 

prehension, arc reported {proba y h-lves and the tests have 

coefficients arc based on , 5 , The correlation between 

quite short time limits) as .751 ^ 

the two tests is reported as .48. ^ 53 inferences from a 

the reliability of the '‘'’’"'"Ji'X should be made very cautiously, 
datum having this level of rehab I y 

The use of subtest scores su ^ average, chance 

justiBablc for a class or '“«" reliability of the scores becomes 
'errors tend to cancel out, 4"“ ™™”, ^me marked weakness 

less important. If the group example, this may pomt 

as in the use of indices ^gUelcd and suggest directions 

out areas in which instruction h® b«" 

for instruction for the group of reading is through 

A second approach to type that has been 

standard oral «4‘«"8 Pf Rerfng Perreges. The test ca - 
used tor many years is Cro; r O j„|j s,mplc to 

S=HHiir3==Sc 

H==SSi“.s:-=-- 

it is sho%vn in Fig- H-f- 


tbt op*"*"* *' 

T"! 1-” 

s„, ..a tS??^ o ,M I..d ■*>.“ "" 

, Tbt wall. ...SCI .lisw.d 1.11 p’*r — ■ 

a.w. lb. .i".>- im-ibg 

0..» p.i.1.1 1” “• 

^.•i^rbi....- On.r'.°”' 





achievement tests 

The record ot the child's oral responses is valuable for 
rhaU rises us into .he actual process of reading. The 
te ™h.en test shosvs us only .he produc of 
marks he makes on a test bookie, or answer sheet. If he 
ot makes mistakes, we are often at a loss to know w y. 
test v.-e can see the errors as they happen— each hesitation, 

Sion, Lh reversal. In this way we can identify ntore 
components that ate giving the child trouble. They are 1 
the me final result, that the child is slow m reading the passage 
does poorly on comprehension questions based on it. ^ ^ 

The oral lest as a basis for diagnosis can be illustrated in an - 

also by the Bus\vell-John Diagnostic Test for Fundamental ro 
in Arithmetic. This test consists of a series of graded exampi 
examples are to be worked out by the child “thinking out bu * ^ 

what he is doing and why he is doing it at each step. The e 
has a record sheet, with a code for types of erroneous process^. - 
page of the record sheet is illustrated in Fig. 1 1.2. The , 

this form to record errors made by the pupil as he speaks out ms 
tion of the problem. A study of the types of errors that the 
making may suggest specific points at which the pupil neecU help, 
opportunity to gain insight into the way in which the pupu ts atm ^ 
the task and to understand the nature of his errors is an advantage 
oral testing procedures in whatever field they may be used. 

In a third type of diagnostic lest the test maker tries to anaty’W 
complex task, such as reading, into its simpler components and 
these components one at a time. Thus, the Gates Reading Diagno ^ 
Tests include tests of recognition of words, recognition of sep 
syllables, ability to blend the sounds of letter combinations, and 
nition of the single letters. The complex skill is pushed back to sm 
and smaller segments of the total task. The thought is that 
person fails on the complex task we test to see whether he is a 8 
show the component skills of which the larger task is built. 

This ty-pe of approach may be illustrated in another field b) •' 
Compass Diagnostic Arithmetic Tests. In these the authors uo 
take to break up each complex skill in arithmetic into its compon~D 
— ^to test the simplest components first, and then to add on additio 


elements until the full task has been tested. Thus, the diagnostic 
concerned with division of whole numbers has sul^ections testing 


:ictest 


child upon the following contributing skills and 


understandings: (11 


the vocabulary of dhision, (2) fundamentals of short division. 


(3) 


short division with carrying, (4> the addition, subtraction, and m^l" 
liplication used in later subtests, (5) estimating the first quotient n? 





3U2 ACHIEVEMENT TESTS 

ure (6) fundumentals of long division and checking. ^ 
e";. in long division. A study of scores on these subsections ma) 
provide insight as to where the trouble rcuHy hes 

Related to this type of test is the test that rs loaded » oPpor 
tunities to make a particular type of emr. Thus, ^ 

Gates in reading diagnosis is one in whieh the eiarain 
of words that lend themselves to reversal errors, t -^-. ° 
no. Sueh a test gives a concentrated exposure and perm 1 
ment of the suseeptibility of the examinee to that particuto 
error. Informal tests of this sort in such fields as " j. 

spelling, etc., are, of course, familiar to any teacher who tnes 
upon the effectiveness of his teaching of particular usages, ru , _ 
eralizations, and understandings. . 

Finally, diagnostic testing in any given field must go iKyond 
mediate field of skiU or knowledge and seek information on au - 
background factors that contribute to success or ‘ 

ticular area. Thus, to understand the child with reading ^ ^ 

need information on his vision, his heating, his general mtellK 
level, even his interests and his emotional adjustment. So a tho .. 
diagnostic study will include tests of visual acuity, muscular bai 
and fusion, testing with an audiometer to be sure the child can 
adequately, a non-reading Intelligence test, and interview or ^ 

naire information about factors in the child's background and ^ 
life that may prove relevant. Diagnostic testing spreads out 1^° ^ 
subject boundaries and a full diagnostic study becomes essenti 
directed case history of an individual, directed in that it is focuse 
the academic problem but comprehensive in that it covers all poten- 
tially significant features of both the skill area and the individual s per 
sonal life. . 

We have described a variety of diagnostic procedures in reading a 
in arithmetic. It is in these fields that the most work on . 

procedures has been done. There are, in fact, few published diagn 
tests outside these fields, though there is certainly informal 
diagnosis. Even in the fields of reading and arithmetic, relatively i 
information about the specific diagnostic tests is provided by the a 
ihors. Evidence on relialnlity is meager, and norms are rather cru 
and fragmentary'. Most diagnostic tests are not very elegant ps)C 
metric de\'ices. They have not been sufficiently widely used to sup^ 
the large investment in development and analysis that 


the more popular survey tests. Interpretation of test scores 


therefore, be made with particular care and a good deal of tentali'C 



ACHIEVEMENT TEST BATTERIES 


303 


achievement measured through PUPIl 

PRODUCTS 

judjcs. Tho judges are .hi, ,ype of sealing is 

and decide which is better. The ^ noticing a difference, 

that the larger the per «nt of judg h 

the larger the difTercncc. T ; . jy gO per cent consider 

nren A to be better than ,Jb is greater than 

B to be better than C, consider C better thm 

the diflerencc between B and C. t 5 ^ „ „,ost be 

D and 50 per cent consider D M dilTcrcnces ate con- 

considered of equal merit, q Jlltcrence that is agreed upon by 
sidcred to be the same s.rc 2“ ^ „„sidered to be the same 

75 per cent of our group of H?« „ . scale units on this 

size wherever it occurred on our sale^^ Basmg^^,^ ^p^,i„,cns front 
easc-ot-pcrception standard, w ^ „„„erical value to each, 
very poor to very good and ass.^ U to compare the spect- 

When we use a product scale, th p , „ jarf samples. His 

men of a pupil's ”1 , o ' d 

product is moved up and do^ 1, is then assigned 

judge decides which one it tnos ^y If greate 

rt:S^-*--0r;uchper— asha^^ 

Product scales have been use potentially app ic 

P— ^ 

result. 

test batteries 

The tests that i-" “Jievement test tor the 

ment «*‘'ng "“^evement testing program* « 
represent pachage 



304 


ACHtEVEWENT TESTS 


schools’ use The typical battery is made up of from tour to eight or 
staram tests co^'eriug the core knowledge and sM segmentsjf 
the curriculum. We shall examine the content of ^ ; 

more detail presendy. The attempt of the 
to produce an integrated instrument that will cover the g 
ment testing needs of the typical community. „_mred ivith a 

The chief virtues of the single battery of tests, as comp 
program made up of separate tests chosen from “ „^tunt 

sources, are those of unity and of convenience A test ^ 

fled in two important respects. In the first place, it ^ 1 ,^, 

unified and integrated plan. The parts have been ^ 
content of each planned with an eye to the whole. Within he l 
of the professional skill and understanding of the team of authors 
product is a unified whole in which the parts fit together to 
range of objectives that they deem important and feasible to app 

with a standardized test. unified 

A battery is unified in one other important respect. U has 
set of norms. The norms for all the subtests are based upon t e 
population and expressed in the same form. This makes oirec 
parison between the different subtests possible with a mmimu 
question. We do not have to ask whether our reading test 
out on the same type of group as our arithmetic test, or how the s 
ard scores of our spelling test compare with the percentile equiva 
of our language usage measure. When tests are assembled 
ferent sources, these problems can be matters of real concern. 
ticularly in the past, when norming populations for tests were . 
bled in a somewhat haphazard manner, the comparability of a S*"® , 
score of 4.0, for example, from one test to another was subject to sc ^ 
ous question. The large, broadly representative groups used in 
ing recent achievement batteries assure both breadth of representa 
for the norms as a whole and equivalence of meaning from one 


to another. 

Of course, the “package” testing program based on a standar 
tcry has certain limitations. The chief one is rigidity. Some sec 
of a battery may fit a particular local curriculum better than o e 
Some subtests of one battery may fit modem curricular 
whereas another battery may seem better in another area. The u 
of the battery gets the good with the bad, “the bitter with the 
Short of omitting certain sections completely, he must use what 
battery offers him, even though in certain respects it may not fit 
needs, as he sees them, as well as some other specific test covering t 



ACHIEVEMENT TEST BATTERIES 


305 

area How serious this is the consumer must judge for himself when 
he compares the subtests of the battery that he is using or propose 

to use list by test with other tests that are available tor measurement 
to use test oy le particulatly in the 

and in practice such instruments arc very ividely used. 

COMPASISON Of EtEMENrASY SCHOOl EATfEMES 

Weproposctotrytog«a_,»-Yeott^^^^^^^^ 

whmat different levels, and ^^t^atTbCl^he ffl'll 

levels, we have chosen the j, probably the level 

and sixth grades. The upper etem vVe have 

at which achievement test . „j„, publishers and ap- 

ctected to compare the following ba^' ^ 

proximate publication dates as mdicatcd. 

CaUiornia Achkmunt TeM, ^Jj[°™Jp['co.''l956 
fovvo r«ls 0/ Bask Skilh. gpok Co., I960 

Progress (STEF), Educational Testi 

S,^kZa. r- Xfc"' 

Slanlard Ach, element « . ^ 

The tests will be ,bat are distinctive, 

that are common and tne leam 

MEASttSEMENr Of WOSD KNOWIEOGE 

Each of the tests except the STB pr ,bis ability 

IcnowWge. However, the tests 'n? ” ubout the individuah 

is kept separate lor scrutiny as a si^ ^bulary only in patograph 
On the on' hand, the SRA total = 99 -'“ =f^ 

context, and include the ^ j,nd HetropoUfW tests yie 

tag ability. By if"'®'' „„ procedure „,hers 

arate vocabulary score, a P reading score, 

gether with paragraph reading knowledge s ^ 

\camomk. ssanlard) pr-* ’ '^graph reading m a total read 

also provide for combining this with P 

ing score. 



306 


ACHIEVEMENT TESTS 


MEASUREMENT OF READfNG 

Every one of the tests provides for the measurement of rea^ng abi - 
ity as Represented by the reading of conneeted “ 

vary quite wideiy, however, m the length of the passages, • j 

auR inge of test items based on each At one extreme 
in the Staniord Achievement Test is based on y j ,i,e 

words long with two or three items on each passage (Thes _ 

somewhat unusual format of omitting words or phrases from h ^ 
sage and requiring the subject to pick the word or phrase 
in the gap-an activity rather different from J 

reading.) At the other extreme, the reading test of the 5R/f 

ment Series is based upon a small number of long passages ( 

600 words) with as many as twenty test items referring o c 
passage. The other tests use passages of intermediate length 
to ten comprehension items testing various aspects of the comp 


sion of each passage. 

As noted above, a number of the tests combine word 
with paragraph comprehension into a single global reading scort. 
may question whether it is desirable to have knowledge of wor s ” 
bulk so large in an appraisal of reading. Of course, word know s 
is quickly and easily measured, but real reading comprehension '' 
seem to be better exemplified by ability to get meaning and draw 
fercnces from connected material. 


MEASUREMENT OF ARITHMETtCAL SKILLS AND UNDERSTANDINGS 
All achievement batteries make some appraisal of ability *5' 
metic. The older batteries tended to break the total area of arit 
up into computational skills and problem solving, and to provide t\' 
subtcsls corresponding to these two areas. Provision was usually ma ^ 
for combining the subtesls into a single global appraisal of 
ability. Among many of the newer tests, however, there is an a 
tional concern with arithmetical concepts and understandings. Ih ^ ^ 
loua tests, the computational subtest has been entirely replaced > 
test dealing with arithmetical concepts. In others (Metropo 
SRA) a section dealing with concepts is included in either the s 'i 
or the problem-solving subtest. Addition of the material on 
reflects the increased emphasis within the arithmetic curriculum 
developing meaning and understanding, as distinct from simply laci 
in carrying out mechanical operations. 

In arithmetic it is often diflficuU to free appraisal of problem-soUaa 
ability from the influence of reading skills. This is illustrated by 1 



ACHIEVEMENT TEST BAHERIES 30E 

raathcmutta test of the STEP. In n commendable attempt to incor- 
porate the arithmetical tasks in real and meaningful Ptobinm*' 
Lthors have introduced a reading level such that the resulting score 
has a higher correlation with a verbal than a quantitative score on the 
parallel aptitude series {SCAT). 

language skills . 

Another common <|=nomina.or^n al, *e hanmies njraisa. of 
various skills in using langua^ . -tements of usage such as 

cally they cover f PJ"' “ ^s in other subtests, there is a 

case, number and tense, and s^ „„l,iple choice form. Thus, 

tendency here too to present a mmner for the recognition of 

most of the spelling tests cnil tn “ 'if Jy. is mis- 

error. The examinee must <,r a given word is 

spelled Uowa)‘, or he must (Metropolitan). Recognition 

mlsspeUed, and if it is he ‘ and it is assumed 

of the correct form is also typical of ability of the 

that the ability to recognize error a ^ 

student to avoid error in his own wn • assumption, on the 

The psychoraelrician is „iih ratings of quality 

basis of the correlation 0 tv ,he gienter reliability, efficiency 

of actual writing, and is taP'ff ThfEnl'i* teacher, however, 

and objectivity of the "“S”'*™ through an actual sample 

would stni prefer to appraise „„ of the achievement 

of writing— an essay grEP^. The essay test is sup- 

batteries that provides for , hough unreliably, something 

ported in pnrt because it may ro«a"r ■ E ,^„her as hav- 

ttiat is not reached by the nhjccbvc ,hu, we roust 

i„g a healthy efiem on the hods to condnue to teach m 

evaluate writing if we are to pwet appears m the ST 

one further aspect of >c^78c ", h listening has a 

lest of listening “■"Prr'T’f" pc"''”" 'Td the 

MEAStlREMENt OF sruoT ;„,o„haUon ha» found a 

Kuowiedge about The emphasis m many 

place in most of the 



jUj ACHIEVEMENT TESTS 

schools upon individual and group projects “P™ up- 

nration from nil parts of a boofc-not ,ust th words m t^^h ^P_ 
ported the need for tests dealing with 

other, most current batteries appra.se feference 

and charts; reading tables; reading maps; picking =‘PP™P"f 
sources; finding information in a reference t,eas such 

These skills are occasionally incorporated in ^ only 

as social studies, science and mathematics (XTEP) or 
as a brief subtest in some other major V battery, 

some combination of them now appears as a distinct ‘ btee 

yielding one or more distinct scores. Thus the P °™“ 
subtests which between them cover all of the *“^ .b ,ed bat- 
above. This is one major respect in which the widely dis 
teries of the present differ from those of the 30 P j^jtjr 

achievement tests have adapted to, and perhaps even he p 
certain types of curriculum emphases. 


MEASUREMENT IN CONTENT AREAS 

The six batteries that we use as illustrations split cven^on 
ter of inciuding tests of information in content areas. Three t 
Metropolitan. Stanjord) include tests in social ’t-nt 

other three do not. However, the three that do provide these 
subiests also provide a “partiar battery, a more limited se o 
from which the content subtests have been omitted, in recogni 
the fact that some schools are not interested in them. ^ 

Tests in content areas have tended to play a less prominent par 
standardized testing in recent years, reflecting a feeling that 
common core of the curriculum items of information ® 
less universal than skills. Thus, at one time a “Literature 
peared in some tests, but it now has disappeared because of the ee 
that the stories and books read will vary too much from 
school to provide any dependable common core in terms of w ic 
schools may be compared. The common core is undoubtedly ar^ 
for science and social studies, but one may still question 
large enough to permit meaningful comparisons between • ^ 
school systems. TTiis has been handled in the STEP by making 
science and social studies tests into reading and study skills tests m 
science and social studies areas. That is, the role of information a^^^ 
specific knowledge in the area is held to a minimum, and the twts 
come primarily tests of ability to comprehend and work with « 
the specific subject matter field. 



USING THE RESUITS OF A SURVEY BAHERY 


309 


BATTERIES FOR H/GH-SCHOOi ACHIEVEMENT 

The batteries that have been discussed so far are for the elementary 
school and junior hish school. It is at these levels that 
have been most widely used. The more depatliaenlalized and spKi^- 
ized program of the high school and colle^ appears » “ 

speciL tests in particular subject areas. Trte ° 

Sion of the Educational Testing Service markets a range of such spechc 

At^rin. 

Lto" use have in common 

ences, social studies, and niathemau . emohasizc correctness 

evaluation of achievement in BtS';! • 1” nal Tests of Edueatiomt 
and effectiveness cl Ss four through 

Prosresj (STEP), which include a „/d)rougli twelve, and 

sbt, as well as batteries for seven skills, with objcc- 

the Srst 2 years of college, emphasize „ subjectively 

tlve tests of reading, writmg, p,,.ztopmfiii go beyond 

graded essay. ,o aPPraise « 


graded essay. The foivu Tesir of ' appraise abilities t 

the fields of content ^ dillercnt subject atcas. 

locate, read, and understand ■ u„„„ledge as well as the 

thus attempting to test ability “ S' .p „ ,),is sort were found 

amount of knowledge already , , ,,, mdWduals much 

especially useful in evaluating school setting, spo- 
of whose education h'd occurred amounts 

cifically soldiers in World War sc„icc. Evidence is reported 

and types of training while m ^ „ predicts college achieve- 

by the test authors that score ^ “ “'fs', high school, 
meat at least as well as grades during yea 

A purvey battery 

USING THE RESULTS ,hc or three 

Since the survey is fit>i"E 

most widely used types of ®"^''^“Vres,inc mav be used and ajv 
sider ways la which the '““'“'r" am done with the t«ul» 

praise the soundness of each. j relali'vly lut'd'- 

from achievement testing. ^“"ne some of the povsiMin- 
perhaps positively harmful. Let pire„. scored. ^ 

One summarizing report, and filed away, 

corporated m some tjpc 



3,0 achievement tests 

is one of .he fonns of futility referred to above. f » t 

possibility and assume that at least something 1 1 
test results. Let us examine various uses that might be 


USE TO EVAtUAIE THE CUMtCULUM OF SCHOOL OR SCHOOL SYSTEM 
As part of a total appraisal of the effectiveness of its “ 

school system may well wLsh to include measures of “ . 

skills. An achievement battery provides a convenient too = 

this. The results will show how well the particular school 
system has progressed on the several components of the 
tion to the normmg groups. However, in interpreting t is p = 
three cautions must be borne in mind. 


1. The evaluation is only partial, not complete. The battery 
give information only on the range of skills that it covers, an 
skills represent only a fraction of the objectives of the mo em 
Because they are so conveniently measurable, they may become 
valued. This is an insidious danger. The school system mus 
to supplement standardized achievement tests with broader an 
informal appraisals of other objectives if it is to obtain a well-rou 
evaluation of its program. . 

2. Local emphases may differ from those that characterized 
tional sample. The particular school system may have placed . 
emphasis upon reading or may have delayed the introduction of to 
instruction in arithmetic. In so far as local emphasis and eiTo 
atypical, local accomplishment may be expected to be atypical, 
ation of achievement in the single school or system must take accou 
of distinctive local emphases. 

3. Evaluation of pupil performance in a school must take 
of the characteristics of the pupil population. Schools, communitic^ 
even regions differ in the economic and cultural level of the popu a lo 
served. Associated with these differences are differences in 
level of ability as measured by our intelligence tests. The expec 
for achievement must be tempered to take these factors into 
This may be approximated by developing regional norms or norms 
schools of a particular type. 


USE TO PLAN THE PROGRAM POR A CLASS GROUP AND THE PUP/LS IN It 
Every fall each teacher in most schools faces a new group of pupp* 
Within the limits set by the course of study (which may in some 
stances be quite rigid limits) he must plan a program of activities 
the group as a whole and must adapt that program as best he can 



USING THE RESULTS OF A SURVEY BAHERY 311 

each ot the children in the group. He must decide where to pick up 
the various skill subjects, how much time to devote to review of mate- 
rials presumably taught in the previous year, and how fast to move 
ahead. He must plan appropriate ennehment . 

rials for independent work and free time. He will probably want to 
form informal groupings within the class for work together at a co ■ 

■" To' d'o'ihese things he needs to get to know j" 

as ouicklv thoroughly, and accurately as possible. Adm nistrat o 

foundations for that picture of scores will 

his plans to the individuals with 

provide a guide as to vvhether the Sroup^^ ^ 

ot retarded in each of »"= *>“ weakness. They will 

may indicate group areas of relati g challenging tasks 

pick out the =“7," ara »hX"’~ ‘"i 

than those presented “ ,h„uld be considered for special 
mending materials, and those e „„i,h a remedial teacher if one 
help either within the classroom or through a tern 

is available. function of informing the teacher 

It should be understood that this t w,ih 

about his pupils is not to y. the class group and 

the children helps the of individual pupils can 

the pupils in it. A Jns But the set of stand- 

only cLe from working "'77 “ J^„ce framework within which 

ard test scores provides an pupils. This function 

to see the test of the picture of the d»s j to,, 

can, ot course, be served by <7 T ctos in the fall. T'7'1 

warded to the teacher when he m tj^^^ ^ jerviccablc. since 
cally, lest results from spnng tc = ,1,0 few sum 

pupds- skills are not likely eati, in Ute fall wdl s«m 

mer t 
more c 
be more 1 
the class. 

USE TO IDENTIFY INOIVIDUAIS FOR WORE 

one function of - “77™rmom Scnsive study. 
tion of the group .7? u„’„dMduaI. there are m eve ^ 
child should be studied as a^ , ,^30 olhers. 

tern some childreu more m ^„ccss m school skills, 

cases in which the symp 


pils’ skills are not likely early in the fall 

T months. But it is likely ^ j},e preceding spring and w^i 

c current and alive in determining his plan, for 

nore likely to be used by the teacher 



ACHIEVEMENT TESTS 

problem may be first identified by poor performance on a standardired 

‘“cross irregularities in performance on “p'below^ 

ance far below his age or grade level, or further 

aptitude as indicated by an intelligwee test J esting 

study. But they are only cues. They are only "mptom must 

that something may be wrong. The ° * utiLal achieve- 

be investigated further. Tn the “ haUhe child is 

meat must be related to a measure of aptitude to s« th 
failing behind what should be expected of fiim. ^ 

achievement that is at issue, his aeh.evement should be^reja 
performance on an aptitude test not involving reading. h 
deficiency appears to he a specific ^ (u,„ine the exact 

further diagnostic procedures need to be applied to 
nature and causes of the deficiency. 


UNDERSTANDING THE INDIVIDUAL PUPIL 

Though special study and remedial activities "“y ^ Ihe 

only part of the children in a class, the school ^^f/pro- 

tesponsib'dity of knowing every child as well as P°®®''’ , (.jiyities 

vide the best possible guidance for him in his present sc 
and in plans for the future. Level of educational „ch 

facet of the picture that is needed in understanding an g 
pupil. Appraisal of present adjustment, planning for , ..jon 

and counseling about a life career can all be helped by i 
about educational progress. 


MAKING UP CLASS GROUPS AND PLACING INDIVIDUAL PUPllS 
In a large school where there are enough pupils to fill several 
in a grade or several sections in a subject, some procedure 
adopted for assigning pupils to particular groups. Fashions 
sped to grouping together children of similar ability have change 
eral times over the past 50 years. At present, this procedure see 
be a relatively respectable one in educational circles. When t e 
decision has been made to try to achieve homogeneous groups 
each classroom, a standardized achievement test or battery pr 
one useful tool for achiesdng this end. 

Of course the term “homogeneous group” is rather mislea 
cause the most we can do is to make a group somewhat less 
gencous. Whether we use over-all level on an 3 ljon 

reading level, score on a scholastic aptitude test, or some com h* 
of these, the children m any group will still vary markedly. They 



USING IHE RESULTS OF A SURVEY BAHERY 313 

vary in part because it will always be necessary to incluile children 

with a range of scores in any group. They will vapi in evm larger 

part because different abilities are not petfecUy correlated. The ch id 

who is most outstanding in reading may be fairly mediocre m an h- 

metic or spelling, and vice versa. Oronping will not do 

need to treat pupils as individuals, or to group them ^ 

for some speL purposes, but it may reduce the 

differences enough so that the whole group can wo rh ^ eter fetmr 

and participate effectively in “nimon academre emerpr se. 

high-school level, where separate grouping « ^ J, 

a specific measure of achievement in the subject 

vide a more useful basis for grouping than a measure of 

“'Thrprobiem of placement in a class also arises for “ 

school system. Here it may be a ‘1’“'?'" „hich he can 

which to place the pupil, but even of * ? ..hicvemenl tests can 

perform adequately. r.::,, ^e to 

help in this decision. They make it ira , j ^ way 

ment with that of the groups mto ^ 

that is not possible from school marks above. 

EVAtUAT/NG THE TEACHER ,„„dirdired achlcvc- 

It is reported that in some s*””' evaluate the success 

ment battery Is used, “fr'y »' ^Jince his class shoiia on 
of the teacher. He is judged by , aar. He >s 

standardized tests given at the cud f jass -up w ■!>' "o™ 
with varying degrees of unrcalism, to bring 

°"™s"pmcedure seems questionable bcjjrp^itam'ennsidera- 

of-school cultural cap c appropnat > ^ laa, groups 

evaluator is prepared m , ofa pa-u«'’”b>r~“"„a can pro- 

rdiT:n.iSriXe^ 

Lctionotlhe-Meeti-esS’' 



achievement tests 

with respect to this partial criterion nc^ects much of 

may provide a very unfair evaluation of . pn.„ium 

whLe strengths tie in different dirccttons. i„evi- 

upon easily testable skills when evaluating the '“'f" 

tably going to lead the teacher to overvalue those skills m hu - 

As he is judged, so will he judge. Skills will lend to become he M- 

eenlral theme of his teaching, at the expense of all the other ‘ 

the school is trying to achieve. He will, with varying 

ness, teach lor the tests. Finally, one may subiert 

effect upon teachers of a mechanical, cxicma! evaluation 

to all the technical limitations discussed above. 


SUMMARY STATEMENT 

The typical standardized achievement test is supcrficiall) 
an objective test made by the classroom teacher. Howes'cr, ^ 
on large segments of knowledge or skill common to the *j 

many schools, and it provides norms. These features mean 
appropriately used in making broad comparisons — between sc 
classes, between areas of achievement, or between achieveme 

aptitude. *«Htcated 

Just as an analysis of the objectives to be measured ‘ ^ 

as the first step in thoughtful construction of a classroom ^ 

analysis of objectives is a prerequisite for evaluating a publishe 
The test can only be evaluated in terms of the objectives 2 
teacher or school is trying to achieve. ^ ^ 

Most widely used standardized tests are surv'ey tests, gising 
eral appraisal of level of accomplishment in a broad area. ^ 
teacher is to work constructively with the pupil, such sun’cy r 
need to be supplemented by more specific and diagnostic inform 
Some published diagnostic tests exist, and these can be supple^®*’ ^ 
by informal teacher appraisals. However, the reliability of 
scores and consequently of differential diagnoses is often low. 
nostic clues should be considered quite tentative. 

Certain skills, such as those of handwriting, shop work, or ^ ^ 
arts, can be appraised effectively by comparing a pupn product 
scaled set of standard samples. 

Standardized achievement test batteries are very’ popular for sc 
use. In these the advantage of unity in plan and standardization 
be weighed against the inflexibility of a single total battery’. Tb® P . . 
lished batteries are similar in general design, though they differ m 
content subjects included, (2) emphasis on work-study skills, (3) 



QUESTIONS FOR DISCUSSION 3)5 

ance of emphasis among different areas, and (4) specific pattern of 
Items in each field. 

When used with discretion and proper reservations, a standardized 
achievement battery can serve a useful purpose as one type of evi- 
dence (1) to evaluate a schooPs educational program and its several 
components, (2) (o help the teacher pfan the work of his class and 
the grouping of pupils within it, and (3) to provide an understanding 
of the individual pupil. Standardized test results should rarely, if ever, 
be used as a basis for evaluating the effectiveness of individual teachers. 

SUGGESTED ADDITIONAL READING 

Harris, Chester W., Editor, £ncyc/opfi//a of educaiional research, 3rd ed.. 
New York, Macmillan, I960, pp. 72-73, 464-466, 804, 881-883. 

Katz, Martin R., Selecting an aehiei'emeni test: principles and procedures, 
Princeton, Educational Testing Service, 1958. 

Lindquist, E. F., and A. N. Hieronymus, Manual for odminlsirators, supers 
visors, and coimielorj, loyca Tests of Basic Skills, Boston, Houghton 
Mifllln. 1956, 

Setiuenlial lesis of educational progress, teacher’s guide, Cooperative Test 
Division, Educational Testing Service, Princeton, N. J., 1958. 

Traxler, Arthur E., The use of test eesulis in diagnosis and intlruciion in 
the tool sdhjeccs, rev. ed., Educational Records Bulletin No. 28, New 
York, Educatfonai Records Bureau. 1949. 

Traxler, Arthur E., el al., fniroduclion to resting and the use of test results 
In public schools, New York, Harper. 1953. pp. 89-95. 


QUESTIONS FOR DISCUSSION 

1. For which of the following purposes would a standardized test he 
useful? For which should a teacher expect to make his own test? Why? 

a. To determine which pupils have mastered the addition and subtraction 

of fractions. , ’ 

b. To determine which pupds in a class are below standard in arith- 
metic computation. 

c. To determine the sub/ects in which each pupil in a class is strongest 
and weakest. 

d. To determine for a class which punctuation and capitalization skills 
need further teaching. 

e. To form subgroups in a class for the teaching of reading. 

2. Examine some standardized reading test. In view of the tasks it 

presents, which of the objectives outlined on pp. 291-292 docs it measure 
adequately? Which does it measure to some extent? Which does it fail 
to measure at all? , , , ^ 

3. Examine a standardized achievement test for a subject that you are 



achievement tests 

.eaching or plan .o .Each, “f ” 

.«. b3..=ri«^ 

hZ do the/ differ? TVhat are the advantages of each from 


same grade, 
your point of view? 
5. What are the 


advantages and disadvantages of a 


dictation as op- 


posed to a raultiple-choicc tj-pc of spelling test- 

6. Suppose you are teaching mathematics in the first > “ > . I 

school. List the steps you vould take to diagnose the achic 

the pupiU and plan for remedial instruction. .liienostic pur- 

7. The manual of test W states that it can be ““d for diagnes P 

poses. What should you look for to determine whether it has an) 
value as a diagnostic aid? ^.f ihe 

8. Why should we be spedaUy concerned this 

scores resulting from a set of diagnostic tests? What implication 

have for using and interpreting such tests? tnierested 

9. Suppose that you are a college chemistry teacher and a . ^ 

in the laboratory skills of glass bloiving that your students have dese 
How might you develop a product scale for evaluating Iheir . i-^ts 

10. Before you can make a sound evaluation of the * 5 nfonna- 

made on a batterj' of achievement tests by a class or pupil, wna 

tion do you need beside the converted scores themselves. _,,ons in 

11. The town of M gives the Stanford Achievement Tests to p P 
ades four and six and records on the cumulative record card mj 


grades four and six and records on me cumulative rew^.w 
grade equivalent for the whole lest. What are the disadvantages 
type of record? . to 

12- You have given a standardized achievement battery' in u 
your fourth-grade class. What might you. as teacher, do on the 
the results? ^ ,, .-hools 

13. In city K, the Metropolitan Achievement Test is given to ail 
in April. TTie average grade level for each class group and for 

ject is reported to the superintendent of schools' office, and these r«u ^ 
mimeographed and distributed to all schools. What are the ^ 

this procedure, and what are the dangers in it? ^Vhat changes wou 

suggest? ^ • ce test 

14. In a fourth-grade group you have data from a group intelhgen 

and from an achievement test battery. On what basis would you ^ 
individuals to receive special remedial work, either in your class or 
special teacher? What are the hazards of this procedure? . 2 

15. What should be the role of standardized test results in eva u 
the performance of the classroom teacher? 



Chapter 12 

T 

Questionnaires and 
Inventoi’ies for Self-Appraisal 


The fast three chapters have been devoted to measures of ability: 
what the Individual can do under lest conditions and motivation to do 
his best. We shall move on now to measurement of other aspects of 
personality-— to the appraisal of what he «vVf do under the natural ctn- 
cumstartces of fife. Both in our discussions of personality and in our 
efforts to develop instruments of appraisal, we must recognfae that the 
person is a unified whole. Any aspects or traits that we may separate 
out are separated out for our convenience. They do not exist as sep* 
arate entities. TTiey are only aspects of or ways of looking at the uni- 
tary person. However, it is inevitable that we do pick the person to 
pieces to study and understand him. \Vc cannot look at everything 
at once. 

In Chapter 2 we identified five segments of personality; to wit: 

Temperament refers to the indivjduaVs characteristic mood, activ- 
ity level, excitability, and focus of concern. It Includes such dimen- 
sions as cheerful-gloomy, energetic-lethargic, excited-calm, introverted- 
extroverted, and dominant-submissive. 

Character relates to those traits to which definite social value is 
attached. They are the “Boy Scout" traits of honesty, kindliness, co- 
operation, industry, and such. 

Adiustmem is a term that we shaft use to mrf/cafe how «eU the 
individual has been able to make peac« with himself and the world 
about him. In so far as the individual can comfortably accept himself 
and his world, in so far as his ways of life do not get him into trouble 
in his social group, he will be considered well adjusted. 

Inierests refer to tendencies to seek out and participate in certain 
activities. 

Attitudes relate to tendencies to accept or reject particular groups 
of individuals, sets of Ideas, or sodal institutions. 

317 



318 


SELF-REPORT INVENTORIES 


methods of studying personality 

Most of .he evaluation techniques we shall consider ™ ^ ^ 
following chapters have to do w^dh one or 7'= “^.^^ence on 

sonality identified above. To what sources m y g 
these aspects when we wish to study an fi„d out 

what the individual has to say about himself. Se“nd, 
what others say about him. Third, wu can sec what he aema y ^ 
how he behaves in the real world of things or people. . ,. 
observe how he reacts to the world of fantasy and make- 

WHAI THE INDIVIDUAL SAYS ABOUT HIMSElf 

One obvious source for information about a “ ‘’’“f johmy 

himself. No one else has as intimate ,ions wor- 

as Johnny has of himself. He is aware of hopes and ’ , 

ries and concerns that may be well hidden from *>'= ?““ ' thL 

at the individual’s view of himself we may interview him, p 
areas that seem sensitive or significant. Another uppr°« .„„^ew 
corporate the questions that might be asked in a face-to-fa 
into a uniform questionnaire or personality inventory. ,„„din 
the individual makes in responding to the set of questions are jj 
various ways to provide a picture of him os he --,[,5 

These procedures will be elaborated in this chapter, and thei 
and weaknesses pointed out. 


APPRAISAL THROUGH THE OPINION OF OTHERS 
For some purposes, we may be interested in how a person is 
ceived by his fellow beings. Is he seen as a friendly fellow wor ^ 
A fair teacher? An industrious pupil? A convincing salesman. 
generally desirable employee? The opinion of others may be t e 
nificant fact in certain settings. It is also a very convenient 
getting a summary appraisal of a fellow man. For these reasons, - 
procedures have been widely used. We shall consider their values 
limitations in the next chapter. 


MEASURES OF BEHAVIOR 

It can be argued that for practical purposes an individual ^ 

aliiy is what he does, rather than what he says or what is said a 
him. The problem is to de\'elop procedures for appraising ^ 
behavior, not distorted for the purpose of making a good impress 
Some attempts have been made to do this with objective tests, at* 



INTEItVIEW 


319 


shall consider Ihese briefly in Chapter 14. Of more importance and 
current interest are procedures for observing the individual and tor 
recording or evaluating his responses as they ate seen by an observer. 

7H£ WO/llD Of MAG/NAT/O/^ AND fA^/TASY 

What an indMdual wiH tell about himself in response to questions 
is limited by his wUiingness to reveal himself, his understanding of him- 
self, and his understanding of the language in which the questions are 
presented. For this reason, indirect methods have been sought to 
avoid these limitations and permit him to “open up“ more fully. One 
indirect avenue is that of fantasy, imagination, and make-believe. We 
may study what the person sees in ink blots, what stories he tells about 
an ambiguous picture, what p/ay scenes he acts out with dolls, what 
he does with paints and modeling clay. These materials and others 
have been used to elicit imaginative productions that psychologists 
have studied as a source of understanding of children and adults. The 
individual js allowed to express himself through play materials or to 
project his own interpretations into ambiguous stimuli, and thus to 
reveal himself to us. These are expressive and projective techniques 
for personality appraisal. We shall undertake to describe and evaluate 
them in Chapter 15. 


JNTfffVlEW 

If we wish to find out about a person, one obvious way to do so is 
to ask him questions and evaluate his answers. If the questions are 
asked orally jn a face-to-face situation, vi-c are carrying out an inter- 
view. The interview has been a perennial favorite as a way of study- 
ing people. It is widely used by colleges and professional schools, by 
employers, and by clinicians working with disturbed individuals. Why 
is the Interview looked upon with such widespread favor? 

The popularity of the interview Is not based primarily upon its dem- 
onstrated validity as a device for appraising people. In fact, evidence 
for the validity of the impressjons or condusfons derived from inter- 
views is spotty and rather contradictory. Interview procedures arc 
basically subjective, variable and heavily dependent upon the skill of 
the interviewer. It has repeatedly been demonstrated that diflerent 
interviewers interviewing the same person come up with quite varied 
impressions of him. The variabUity arises in part from variation in 
the questions asked and the lines of inquiry intensively pursued. It 
arises In part from differences in interpretation and evaluation of the 



SELF-REPORT INVENTORIES 

responses the individual makes. The typical interview is not a precise 

or efficient psychometric technique. ncxibility and 

The appeal of the interview lies rather in its great IlcxiDi y ^ 
adaptability. The interviewer can structure the interview i 
way seems to him most suitable, in the light of the pu^ 
interviewing and of the responses elicited to P™' 
skim over certain areas; probe mtcnsivcly in others. He can g 
play to his “clinical insight” and “intuition.” situa- 

There is no doubt that the flexibility possible in the inte 
tion has eertain elements of strength. It permits the 
to take full advantage of everything he has learned about 
viewce as he directs the further course of the interaction, 
same flexibility contains elements of weakness. It tends 
comparability from one interviewer to another and from 
viewee to the next. It makes it possible for an interviewer to nue 
personal hobby and ignore many obvious areas of inquipi. 
permits full scope to the wisdom of the wise, so also it a 
rope to the foolishness of the foolish or the biases of the 
One approach that undertakes to reduce the subjectivity an 
bility of interview procedures, while still maintaining the flexi 1 1 y ^ 
vividness of direct personal contact, is the structured^ {^^dCtailed 
Structured interview procedures give the interviewer a fairly^ liided. 
guide of topics to be covered and areas of inquiry to be me 
These may Include, for an employment interview, family 
interrelationships, school interests and activities, sports 
previous work history, reasons for leaving previous jobs, an 
similar areas. Within any one of these rather broad areas ere 
be several more specific questions to which the interviewer is 
an answer. The interviewer retains freedom and flexibility ^ 
spect to the order in which he attacks the different topics and t e 
to which he pursues each. At the same time, he has a guide to _ 
sure that a standard set of areas of inquiry is covered in eac m 
view. The structured interview is a compromise between the ree 
terview on the one hand and the printed biographical data blan 
personality inventory on the other, 

The questions during an interview session elicit responses t 
descriptive of the individual. However, these responses 
degree of interpretation if they are to provide a useful picture o 
The interpretation may sometimes flow quite directly from the 
fest content of the responses. This would be the case when the m 
viewer interprets a report of membership in many school 
the holding of many school offices, out-of-school experience in se * 
and membership in the debating team as evidence of assertiveness a 



THE BIOGRAPHICAl DATA BUNK 32f 

social leadership. Sometimes the interpretaticm may be quite indirect, 
and dependent upon the latent or concealed, rather than the manifest 
or obvious content of the response. This is true of many of the psy- 
choanalytic interpretations of the communications from patient to 
analyst, In these variable and unstandardized interpretations lie po- 
tential strengths and frequent weaknesses of the interview as an ap- 
praisal technique. 

The clinical interview is an unstandardized inquiry, highly depend- 
ent upon the particular Interviewer both for the way it is carried out 
and for the way it is interpreted. Furthermore, individual interviews 
place very heavy demands upon the time of interviewing personnel, 
demands which may be prohibitive in a number of situations. To 
cconomae on interviewer lime, then, and to provide an inquiry that 
is uniform in presentation and procedure for evaluation, the printed 
questionnaire has been developed. The self-report questionnaire or 
inventory is essentially this: a standard set of questions about some 
aspect or aspects of the individual’s life history, feelings, preferences, 
or actions, presented In a standard way and scored with a standard 
scoring key, 

THE BIOGRAPHICAl DATA BLANK 

An obvious and important use of the questionnaire is as a means 
of eliciting factual iaformaelon about the individual’s past history. 
Place and date of birth, amount and type of education and degree of 
success with U, nature and duration of previous jobs, hobbies, special 
skills, and a host of other biographical facts can be determined most 
economically through a blank filled out by the individual himself. It 
is the economy and efficiency of this approach that makes it particu- 
larly appealing. Though his reports may be inaccurate in some re- 
spects. the individual himself b probably the richest single repository 
for the factual information we would like to have about him. 

The problems in using questionnaires to elicit facts are primarily 
problems of communication. When questions are preformulated and 
appear in printed form and answers are written down, misundersfand- 
ing may occur either in the respondent’s interpretation of the question 
or in the using agency’s interpretation of his response. If there is no 
personal interaction, these misunderstandings cannot be cleared up 
with an oral question or a further probing into the area of uncertainty. 

It is important, therefore, that a fact-finding questionnaire be very 
carefully worded and that it be tried out in preliminary form with 
small groups to make sure that the ambiguities have been cleared out 
of it. 



SEIF-REPORT INVENTORIES 
An interview to supplement the 

order to permit clarification of any of the „„ some 

items that are puzzling to the user or to 8=‘ .soport inven- 

noints As a matter of fact, one appropnate use ot sen rep 

tories of all types is to provide a l^P^S-f J^o^-ed up in the 
the questionnaire providing leads that may 

'"'l^iSLes the factual information on an =>PP!'f “ 
fact-finding questionnaire has been used f job. 

individual meets certain stated requirements to be =''6*' ‘ > 

educational program, or the like. Sometimes it ^ “ “rtor of ad- 
of the taw material from which the P=™""=’ PT'“ j ont of the 
missions, or scholarship committee makes a clmi ) g ;„ 5 ,j„ces, 
individual's desirab'ility as an employee or student. In a 
however, biographical data blanks have been analy Jlually 

to determine to what extent particnlar responses to - -,„aB 

predict some criterion of job success. Items foun _ 
more successful from less successful individuaU ate ^ 

credit, and the separate items are summed to give a score fo 
as a whole. Thus, the World War 11 programs for selKtmg P 
trainees for both the Army and the Navy used a insurance 

data blank that was treated just as if it were a test. The ine- 
companies have for a number of years used an Aptitiiae ' jji 
selecting insurance salesmen, one section of which consisls o 
items about the individual applicant. Thus, the gic. 

about the amount of insurance he himself carries, his net 
A scoring system assigns scores for each response in terms o ^ 
cess experienced by those in the validation group who had give 

response ^ raohical 

In the examples given above, objective scoring of a biog P 
data blank provided one of the most valid predictors of job su 
These results suggest that there may be a number of other se e 
situations in which a standard scoring procedure could be use 
advantage The development of scoring weights is a major un 
ing, but once a scoring system has been developed the scoring o 
vidual blanks proceeds rapidly. It has even been possible in ® 
use of biographical inventories to prepare them in multipl 
form and score them like any standard test 


INTEREST INVENTORIES 


study. 


One aspect of the individual's make-up that we would like to s 
both to understand him as a person and to help in such immedt3 



INTEREST INVENTORIES 32 j 

practical problems as educational and vocational guidance, is the do- 
main of interests and aversions, preferences for activities and surround- 
ings. Of course, in the matter of vocational interests, the simplest 
procedure would seem to be to ask the individual how much he would 
like (0 be an engineer, for example. However, this doesn’t work out 
ve^ well in practice. In the first place, people differ in (he readiness 
with which they exhibit enthusiasm. “Like very much” for person A 
may signify no more cnihuaasm than “like” for person B. In the 
second place, people differ substantially in the nature and complete- 
ness of their understanding of what a particular job means in terms 
of activities and conditions of work. “Engineer” to one person may 
signify primarily out-of-doors work; to another it may carry a flavor 
of the laboratory or drafting board; to still another it may signify 
vaguely a high-prestige, science-oriented job. These varied and in- 
complete meanings cause a response to the single question, “How much 
would you like to be an engineer?” to be a rather unsatisfactory indi- 
cator of the degree to which the individual has interests really suitable 
for the profession of engineering. It Is for these reasons that psycho- 
metricians have undertaken to broaden the base of information and to 
ask a whole array of questions about the individual’s likes and dis- 
likes, rather than simply to ask directly about preference for particular 
Jobs. 

THE SrWMC VOCATIONAL INTEIteST ElANK 

One of the best known instruments for appraising interests Is the 
Strong Vocational Interest Blank for Men. This inventory is made 
up of 400 items, broken up into the following types: liking for occupa- 
tions, liking for amusements, liking for activities, reaction to peculiarK 
tics of people, choice or preference between activities, and evaluation 
of personal abilities and characteristics. To most of the 400 items 
in the Strong Blank the individual responds by marking one of the 
three given options L, I, and D (Like, Indifferent, Dislike). A re- 
sponse is called for to each item. Over 40 different scoring keys have 
been developed for the men’s blank. Most of these are for specific 
occupations, largely at the professional level, such as architect, chemist, 
lawyer, or YMCA secretary, though there are also keys for interest 
maturity, masculinity of interests, and occupational level. 

The scoring key for each occupation was developed by comparison 
of a group of men who were successfully engaged in that occupation 
with a reference group of inen-in-gcneral. Thus, the per cent of men 
in occupation A choosing the L, I, and D options to item 1 is com- 
pared with the per cent of men-in-gencral choosing these same options. 

If enough more men in occupation A choose a particular option, that 



SEIF-P.EP08T INVENTORIES 

option receives n plus score for ^ 'core' “nlTe 

smaller for occupation A, the option r^i for men- 

pcr cent for occupation A is very much 1 ^ ^ scores 

in-general, the score may be as much m , veichted to 

are assiened to smaller differences. Thus, “ 

take account of the sharpness with which the item 

Table 12.1 shows the scoring key for the fiist sicidits for 

for four different occupational keys. Note ,„s"a given 

the different items. Note that some or all of the optio 
item may receive a zero weight. 

Table 12.1. Scoring Weights for Somple Hems and Keys of S 9 
Vocofionof fnfererf Bfonfe for Men 



An individual’s score is obtained by summing up the p 
minus credits corresponding to the responses he has {s 

the weights are different for each occupation, a separate 
required, and a separate score is obuined for the jy the 

scale. Thus, a series of scores is obtained showing ^ven 

responses given by our examinee OTrxespond to those typi 
by each specific occupational group. Raw scores are trans . 
a standard score scale in which 50 represents the mean for me 
specific occupation- A scale of letter grades is also provide » 

A represents close resemblance to the particular occupationa 
B-h, B, and B— lesser degrees of resemblance, and C— ox yp, 

patterns quite different from those of the particular occupation 
Table 12.2 shows the standard scores and letter ratings on t e 
pationa! scales of the blank for one college freshman, 
man shows interest patterns resembling closely (A) those o c 



INTEREST INVENTORIES 


323 


Table 12.2, Scores on Strong Voctahrtat Merest Blank for a 
College Freshman 


Standard Letter 
Occupation Score Rating 


r. Artist 

26 

C-h 

PsychoJagist 

22 

c 

Architect 

29 

C-f 

Physician 

42 

B-h 

Dentist 

41 

B+ 

II. Mathematician 

26 

C+ 

Engineer 

44 

B-l- 

Cfiemist 

52 

A 

111. Production manager 

39 

B 

IV. Fanner 

59 

A 

Carpenter 

44 

B+ 

Math and science teacher 

48 

A 

V. VAfCA pfij-stcai dtreewr 

34 

B- 

Personnel manager 

21 

C 

YMCA secretary 

Low ■ 

C- 

Sociahsciencc teacher 

17 

C 

City school superintendent 

Low* 

c- 

Minister 

Low * 

c- 

VI. Musician 

25 

c+ 

VII. CPA 

16 

c 

VIU. Accountant 

25 

c-f 

OiHce worker 



Purchasing agent 



Banker 

22 


IX. Sales manager 

19 

c 

Real estate salesman 



Life-insurance salesman 

Low * 


X. A<J^’e^tisi^g man 

19 

c 

Lawyer 



Author-journalist 





"Low" designates a standard score of 15 or lower. 



SEIF-REPORT INVENTORIES 

farmers, and mathematics and scienra carpen- 

quite like (B+) those of of YMCA secretaries, 

ters. His mterests are ver>’ "’I malice salesntett. 

city school superintendents, ministers, interest BUinl: lor 

Strong has developed a companion Hoa- 

tl'omcn that follows closely the pattern o e ^ ^ 

ever, the blank has been rather less This may 

mens blank, and seems to have been less central 

be due to the fact that speafically vocational mte 
in the lives of many women, being “inated by generd ^ 
maker" interests, so that interest profiles in women ten 
clearnnit and meaningful. , . „ ^ very 

Originally, scoring the Strong Vocanonnl mercnl scores 

time-consuming task because of the larp num of several 

that are caUed for. Hand-scoring “ test-scoring 

hours' work. But twentieth century ^"tt^ed- it pos- 

field, and a special device developed by E. 3. ^ _ machinVu 
sible to score the bbnks at very high speed. This 
available only at Engineers Northwest, Minneapohs, Mmn^m 
special answer sheets must be sent to ^tfwTuld ic by 

be scored al a cost that ts a fraction of what ih. cost 

hand methods.* . c,rnns Blank to 

There are two points about the constroction of place, 

which we uish to call especial attention al this tune. In me 
the person taVdns the lest responds by choosing one ot a - 
sponse categories for each item (L, I, D). A nne 

individual cowW choose all L.*s. and a particularly jaun ice one’s 
choose all D’s. There is a certain amount of freedom to 
own standards upon the lash. Secondly, the kej'S are job 

mined. That is, they are defined by the responses of a 
group and not b>' any internal logic. We wish now to (xn 
the Strong Blank the Kiider Preference Record, which is dinere 
respect to both of these features. 


THE KUDER PREFERENCE RECORD (VOCAUONAU 
The Kuder Preference Record {VocalionaJ) is made up of tris 
or sets of three options. Topical sets might read: 

Go for a long hike in the woods. 

Go to a s}*mphony concert. 

Go to an exhibit of new inventions. 

• A price of 70f per answer sheet was quoted in I960 for scoring blzn^ 
quantity lots. 



INTEREST INVENTORJES 


327 


Fix a broken clock. 

Keep a set of accounts. 

Paint a picture. 

In each set (he individual is required to mark the one he would like to 
do most and the one he would like to do least. 

Scoring keys were established on the basis of the internal relation- 
ships of the items. Thus, a study of the responses to the items showed 
that a number of items dealing with mechanical activities tended to 
hang together. If a person chose one he was likely to choose others, 
and if he rejected one he was likely to reject the others. Moreover, 
items in this group showed relatively little relationship to the remain- 
ing items. The items grouped together m a distinct cluster. From 
the nature of the items it was evident that this cluster related to me- 
chanical interest. Those items having a substantial correlation with 
this cluster were included in a scoring key that gave a score for me- 
chanical interest. 

In the same way, other clusters were identified and built up in which 
the items went together but were largely independent of items not in 
the cluster. Scoring keys were developed for these. The Preference 
Record now yields scores for the following interest clusters: outdoor, 
mechanical, computational, scientific, persuasive, artistic, literary, 
musical, social service, and clerical. Raw scores are converted into 
percentiles, separate norms being supplied for male and female high- 
school students and for male and female adults. 

In Table 12,3, the Kuder scores arc given for the same college fresh- 
man whose Strong scores were shown in Table 12.2, On the ICnder, 

Table 12.3. Kuder Preference Record Scores ot o College Freshman * 
Percentile 

Interest Area Raw Score Equivalent 

Outdoor 
Mechanical 
Computational 
Scientific 
Persuasive 
Artistic 
Literary 
Musical 
Social Service 
Clerical 


• Scores for same individual shown in TaWc 12-2. 


71 95 

58 87 

17 16 

60 93 

75 07 

30 68 

23 78 

12 45 

36 46 

19 Of 


g SELF-REPORT INVENTORIES 

Lig™ "n " ~ •> "■"“ - "““" “■' 

sistent and support one another. 


COMPARISON Of STRONG AND KUOK INVENTORIES 

Note that in the Kiider Preference Record, the examinee 
pick a most liked and a least liked activity and 

Lw much or how little he likes aU three, one P™" 

one rejected. This foreed<hoice pattern appears m a numb 
ventoiies and should be contrasted with the „^on 

found in the Siren?. The forced-eho.^ „ °l^rJ optimisn. 

frame of reference upon everyone. Differences i S J 
are controlled. Everyone must express the same of 

ences and rejections. Thus, superficial differences . 

judgment, or what has been called “respon^ set. -^^XTiorcci- 
so also are genuine differences in interest level. of 

choice pattern produces a net gain in this respect is sfi 

Note again that in the Preference Record 
to coherent interest clusters rather than to something ouUid 
vidual or the test. The scores carry their own relatively dir^ 
ing in terms of the common theme running through the clus , 

The meaning does not have to be inferred by thinlong wha 
or salesmen arc like. If our purpose k to build up a meanm_ 
scription of an individual, the internally consistent scales ^PP^^ . a 

satisfactory than those that arc externally oriented. To say 
person is high on mechanical, scientific, and out^if-doors interes 
low on clerical and persuasive is more directly interpremble t an 
he is high on interests characteristic of farmers, chemists, an ni 
matics-sdence teachers and low on those characterizing 
YMCA secretaries. Internally coherent clusters definable in te 
their common theme “make sense” better than job-oriented 
VtTien it comes to rating the individual for a specific job, 
the balance of advantages is radically changed. If our concern i 
help the indmdual decide whether he would be content in the jo 
engineer, it is much more directly relevant to know how well 
tcrcsts correspond to th(»e of successful engineers than to know 

hieh his mechanical and scientific interests are. In the first case, 

■ the sec- 


scoring key itself defines what the interests of engineers arc; m 
ond case we must cither infer this or determine it from a separate 


study- 



INTEREST INVENTORIES 32P 

Either the internally consistent or job-oriented approach to inven- 
torying interests is possible; which will work beirer depends on our 
particular purpose. If our purpose is lo appraise appropriateness of 
interests for a limited number of specific jobs, this may be done effec- 
tively H’ith a specific job key for each job. If, however, our concern 
is (o get a meaningful description of a person and perhaps to be pre- 
pared to use that description to make inferences as to his suitability 
in any one of a very large number of jobs, then the homogeneous 
cluster scores seem preferable. 


RfltABtUrY, VALIDITY, AND PERMANENCE Of INVENTORIED INTERESTS 
The Strong Vocational Interest Blank is one of the most thoroughly 
investigated psychometric tools we have, and, though the history of the 
Kuder Preference Record is shorter, it too has been intensively studied. 
Both instruments yield scores that arc reasonably reliable for indi- 
viduals in their teens or over. Thus, for 285 Stanford University 
seniors Strong "* reports odd-even reliabilities for the separate occupa- 
tional scales ranging from .73 to .94, with ao average value of .88. A 
number of reliability studies with the ATiirfer. based on analysis of a 
single testing, give values averaging about .90. The reliability of the 
scores extracted from these interest inventories compares favorably 
with that of scores on ability tests. 

For the there is evidence that interests show a 

good deal of stability over time, at least in adolescents and adults. 
Data on the average correlation at different ages and over different 
periods may be summarized as follows: 


1 or 2 years 
3 to 5 years 
6 to 10 years 


Upper 

niementary 

High 

College 

College 

School 

School 

Freshmen 

Seniors 

.55 

.65 

.80 


.30 


.75 

.75 


.50 

.55 

.70 


The stability is low in the elementary school, but for persons of col- 
lege age stability compares favojsbiy wilh that for intelligence tests. 

In appraising the validity of an interest inventory as a description 
of how the individual feels about activities and events in the world 
about him, the main issue is the truthfulness of his responses. There 
isn’t really any higher court of appeal for determining a person's likes 
and preferences than the individual's own statement. 

A number of studies have indicated that inventories such as the 
Strong can be faked. If a group of examinees is told to try to re- 



SELF-REPORT INVENTORIES 

Tond the way .ha, hfe-tnsurancc " ^ 

rather successful in PP y ^ ,^,,7; be faked, even 

man. However, this is no indication that the blank 

when used as an employment device. respond- 

When the inventory is used for “J* ,‘^,,ici- 

ent, as is most often the case, there -s probably i d 
pate intentional faking. The tndjyidual may be 

likes and dislikes as he knows them. «' /"t/peets. 'Htt-s, 
imperfect, so his reports may be inaecuratc m so P^ 

he may say that he would like to ^ „ belie his state- 

feels that that is the thing to say, but his actions m y 
ment; he may in fact avoid concerts *^ably mitigated 

This lack of self-insight is a real problem. But it points 

somewhat, in the inventory approach to interests, wh 
of poor insight will have only minor elTects “P°" behavior 

The validity of interest inventories as predictor , yijbed by 

is another maVr. Scoring keys tor J^he 

comparing men who were already in the “““P”’ .spared by 

eral. Kuder occupational interest profiles have also l«:en P P . j,. 
determining the average level in each of ^n Tm^est 

viduals already working in the occupation. But ‘he comm 
patterns of individuals in a field of work may have g patterns 

work. The men may have come to exhibit ? evidence 

from the very nature of their work expenence. The ,bey 

on predictive validity would come from testing a group I 
entered the world of work and determining whether those 
entered and continued in a particular occupation had oisii 
terest patterns bejore they entered the occupation. [pje men 

pensive operation, expensive in the time that must elapse . jjoj 
can become settled In their occupation and expensive in tne 


of cases among literally hundreds of occupations. , ^5 

Strong ” has been able to follow some groups who were 
college undergraduates and does have some evidence on the ® 
which students with interests characteristic of a particular oc 
tended to enter that occupation and to persist in it. For t e 
individual, the occupation in which he was actually working 
later ranked second or third for him among all the scales of t ® 
Considering group averages, those who remained in an t j. qC' 

ceived higher interest scores for that occupation than for any o 
cupation and higher than those who switched to some other occup 
McCully'® followed up a group of men who had been 
Kuder as a part of Veterans Administration counseling at t e e 



INTEREST INVENTORIES 


331 


World War II They were located several years later, and their occu- 
palion determined. Table 12.4 shows the average standard scores 
on each of the ten Knder interest areas tor those occupattonal group 
that were large enough to justify study. The results show elear-ci 


Table 12.4. Meon Kod.r Standard Score, of Different Occupational 
Groups * 




rtlhlctl 
EtiEtnorTitig JTitl 

frialnl "<1 

M»r»g»ria1 wort -2S 
Otfichl— 
cotrpuiiiis • 1 '^ 

tttofJiM 

Omcril {Irruiil 
*mk -I’ 

S«lr»— hifher 

Stile 

Orneral (jriniM 
Mrtbininl 

reimlnnR 
Plertric»1 
repstrtns 
petieh er»f«' 

(fine) 


• H»«d on a mean of 0 
ffrploTeJ vtlerani 



ffTplojed vtterane. 

and fairly substantial ^‘^''^"“yj^eet » to* 
dictors of occupational choice. 

INTESESr 4ND , r.,e„st and ability. The 

It is important not ‘°.‘°'''“*J^"sriOTtiric Interest scale of the Kiider 
tact that a boy scores high OT t ^ possesses 

or on the physicist scale of the n™ S concepts of 

the intellectual and other measures tell us nothmg d - 

physics and become a phy*”'' • in a moment, 

ieclly about abilities, though, as crests. Interest measures 

certain relationships distinct aspects of ' 

and ability measures deal iutorraatiun that supplements 

a held of study or work, taca P 



SELF-REPORT INVENTORIES 

“ other, interest is not a .bstitute 

ity to learn the skills of a job is no guarantee of success or 

‘''Theie have been many studies of the 

and ability.* Most of these have related to aspect of ^d^^ ^ 

In general, the relationship between on the 

science and the corresponding interest (i.e., ^ ^ interest 

Kuder) is positive but low. “‘=^^'' 5 ” Thus, there is 

in the corresponding area will ™" about -JO t° _5<)_ 
some slight tendency for those with high ability 
edee to show high interest in it. But the '^alnb°aah p 
for“ either type of measure to serve in place of t e 
of information are needed for any sound evaluation o 
suitability for a particular program of study or plan fo ' 
Standardized interest inventones have been developed p 
their contribution to vocational counseling and job place® ■ ^ 
this purpose in mind, they are directed at groups of \ g, been 

older. The Kuder, with its relatively general interest 
used satisfactorily at about the ninth grade and above, 
focusing on specific occupations and with a particular emp 
occupations at the professional level, is suitable primarily 
high school pupils with definite plans to go to college an o 
croups. As in almost all inventories, these instruments invo 
deal of reading. Their use with individuals who fall below 
ninth grade reading level would probably present serious pro 
Several other interest inventories are listed and briefly desen 
Appendix 111. 


TEMPERAMENT AND ADJUSTMENT INVENTORIES 

Self-report inventories have been extensively developed in 
of temperament and personal adjustment. In these areas we a- ^ 
counter instruments developed to yield scores for internally con 
clusters of behaviors, as did the Kuder Preference Record, an 
ments built with keys based on reference to some external cn e 
as was the Strong Vocational Interest Blank. . p. 

The basic material of all temperament and adjustment ^ 
naircs is much the same. They draw from an extensive 
statements about actions and feelings. To these the indivi ua 


See Frandsen * for a review of some of Ihis material. 



TEWPERAMENT AND ADJUSTMENT JNVENTORIES 333 

sponds by indicating whether each is or is not charactenslic of him 
In many cases, a “?” or “uncertain" category is provided for the per- 
son who docs not wish to endorse an unequivocal “Yes” or “No” 
answer. In the case of adjustment questionnaires, questions are culled 
from case studies, writings on various types of adjustment problems, 
suggestions of psychiatrists, and similar sources. For the normal di- 
mensions of temperament, a review of psychological and literary treat- 
ments of personality dilTerences and a systematic scrutiny of previous 
questionnaires, together with the pereonal insights of the investigator, 
provide the raw material for assembling items. 

There are a large number of temperament and adjustment inven- 
tories. We will describe three in some detail, illustrating distinctively 
different patterns. These arc the Guilford-Zim/nermon Temperament 
Survey, the Minnesota Muhiphasic Personality Inventory (MMPI), 
and the Edwards Personal Preference Schedule (EPPS). Then we 
will undertake a more general evaluation 0 / the validity of inventories 
in this area and of the conditions under which we may expect them to 
be of value. 

we GWieoPo-z/MMeitMAN reMee/tAAteNT suaveY 
The Cuilford’Zimmeriruin Temperament Survey is the most recent 
development in a series of instrumems on which Guilford has worked, 
each of which has attempted to identify and measure a number of in- 
ternally coherent dimensions of personality that are clearly distinct 
from one another. Guilford has started with a pool of items and 
studied the intercorrelations among them, using the methods of factor 
analysis to which we referred on p. 262. He has identified distinct 
personality factors or foci, and tried to build up clusters of items to 
measure each. The objective is to get separate scales that arc inter- 
nally coherent and that arc relatively independent of other scales. 
Thus, if a factor of “sociability” is identified, one attempts to get a 
cluster of items focussing on “sociability” that correlate substantially 
with each other, so that the person who subscribed to one item is 
likely also to subscribe to others. This cluster should be quite inde- 
pendent of other clusters relating to “dominance,” “impulsiveness,” 
and so forth, so that the correlations between the different clustere are 
quite low. This is the same basic approach as the one we saw in the 
Ki/der Preference Record. 

The Guilford-Zimmerman inventory provides scores appraising the 
clusters named and characterized bclow.^ Each cluster is characterized 
both by descriptive phrases and by two iUiwtrafive items. 



334 


SELF-REPORT INVENTORIES 


Generd Ac, My. A high score indicates tapid pare of a^tWties; 
energy, vitality; keeping in motion; prodnchon, efficiency, hkmg 
speed; hurrying; quickness of aetion; enthusiasm, liveliness. 


Sample Items 

You start to work on a new project with a great deal of enthusiasm. 
^^You are the kind of person who is "on the go" all of the time. (+) 

Restrain,. A hlch soore indieates serious-mindedncss; deliberate 

ness; persistent effort; self-control; not being happy-go-lucky or ca - 
free; not seeking excitement. 


Sample Items 


You like to play practical jokes upon others. (— ) 

You sometimes find yourself “crossing bridges before you 
them," (+) 


come to 


Ascendance. A high score indicates habits of leadership, a ten ^ 
to lake the initiative in speaking with others; liking for^ spea ^ » 
public; liking for persuading others; liking for being conspicuous, en 
ency to bluff; tendency to be self*defensive. 


Sample Items 

You can think of a good excuse when you need one. 

You avoid arguing over a price with a clerk or salesman. ( ) 

Sociability. A high score indicates one who has many friends Jtnd 
acquaintances; who seeks social contacts; who likes social activiU^, 
who likes the limelight; who enters into conversations; who is not s >- 


Sample Items 

You would dislike very much to work alone in some isolated place- 

(+) . 

Shyness keeps you from being as popular as you should be. \ 

Emotional Stability, A person with a high score shows evenness 
moods, interests, etc.; optimism, cheerfulness; composure; feelings 
being in good health; freedom from feelings of guilt, worry, or lo^^ 
ness; freedom from day dreaming; freedom from perseveration of i 
and moods. 


Sample Items 

You sometimes feel “just miserable" for no good reason at all. 
you seldom give your past mistakes a second thought. f + ) 



TEMPEMMENT AND AOJUSrMENT INVENTOBIES 335 

Ob/mh-ily. The high scorer is defined as free from the following: 
egoism, selfmemcredness; suspiciousness, fancying hostility; ideas of 
reference; a tendency to gel iulo IroobJe; a rendcccy to be thmstimed. 

Sample Items 

You nearly always receive all the crcaii that is coming to you for 
things you do. f + ) 

There are times when it seems everyone is against you. (-) 
Friendliness. High scorn signify respect for others; acceptance of 
domination; toleration of hostile action, freedom from hostility, re- 
sentment, or desire to dominate 

Sa$np\e hems 

IWien j-ou resent the actions of anyone, you promptly tell him so. 
(-) 

You would h'ftc to ted certain people a thing or two. (-) 
T/ioughtfuiness. The htgh-scoring person is characterized as reflec- 
tive, meditative; observing of his own behavior and that of others; in- 
terested in thinking; philosophically inclined; mentally poised. 

Sample hems 

You are frequently “lost in thought.” (+) 

You find it very interesting to watch people to see what they will 
do. (4-) 

Personal Relations. High scores signify tolerance of people; faith 
in social institutions; freedotn from self-pity or suspicion of others. 

Sample hems 

There are far too many useless taws that hamper an individual's 
pcrsonalfreedom. (—) 

Nearly all people try to do the right thing when given a chance. 

(+) 

Mosailinily. The high-scoring pcison is interested in mascuiine 
activities; not easily disgusted; baidboHed; inhibited in emotional ex- 
pression; resistant to feat; unconcerned about vermin; little inlercsled 
in clothes, style, or romance. 

Sample hems 

You can look at snakes without shtfddering. (+) 

The sight of ragged or soiled fingernaas is repulsive to you. 

Since each of these clusters can be thought of as a dimension hay- 
ing two ends, just as we have norib and south, east and west, there rs 


SEIF-REPORT inventories 

r opposite end of each dintcnsion that -n b* "eri-^ 
the reverse of the people do not score at 

acterizc this opposite end. O elsewhere, a continuous 

either extreme on these dimensions. . intermediate position 

range of variation wtih most ^ outstandingly 

is the characteristic pattern. Most P“P -scendant not clearly 

active nor conspicuously lethargic, "'=’*<; ')' clear-cut person- 

submissive. People can rarely be well ^ s-arying 

ality types. They are described as showing different traits 

"sing the names for the clusters ^heln- 

a problem, because the clusters do not “"“P.°"^ S bv the items 
guaee labels we bring with us. Each cluster i defined b>^» 

Lt went into it and that were grouped Tbe 

went together in the responses of people taht jactly 

titles are approximate. Each cluster can be 
only by a close study of the items of which it is “ P “ 3 „a *e 

Table 12.5 • shows the reliabilities of 'b=^f “ jbout .80 
intercorrelations of the scores. The 3 i 3 „p, b devel- 

and are adequate, though not stnlungly high. T^^^y^jpendent 
oping this inventory, was to identify a number of dif- 

aspeets of personality. This means that 'be correlauons^ 
ferent scores should be losv. They tend to be. be 

the scores show rather substantial correlations. .>nd Ob- 

directed to Ascendance and Sociability, Emotional S ^ and 
jectivity, Friendliness and Personal Relations, and and 

Thoughtfulness- These pairs of scores are far from indepe 
the information provided by the scor^ is overlapping, n a 
inventory is only partially efficient because of^ the duphea lo 
different scores. It is as if wc were in part saying the same 

• People who read about Icsu and tesUng will frequenlly have Ijft 

study tables of correlations like Table 12J. In the table the co uiM ^ 

lists the different variables and numbers them in order. The num lately! 

the names) arc repealed across the top of the table. Look at t e ^^igjjotss 
“I General activity." The numbers that appear in this row ‘ 'j. —16. 

of “general activity" with each of the other variables. The 
is the correlation between “general activity" and variable 2. res 
means that there is a slight tendency for high scores on the ^^"corTC' 

to go with low scores on the restraint scale. The next figure. .3 . 
lation of “general activity" with “ascendance," and the other entn« ^ foocd 
read in the same way. The correlation between any two vanables »i 
in the row and column whose numbers correspond lo those vana ' the 
table, the reliability coefficients for the %'ariables are shown in a co u 
extreme right. 



TEMPERAMENT AND ADJUSTMENT INVENTORIES 33, 

Table 12.5. Inlercorretations ond Reliabilities of the Ten Scales of the 
Guiffofd-Ztmmermon leroperoment Survey 


Asctnilaiici 

iiMublllt)' 

Fnioiiiiiial 


iS 34 H -•>? 


24 - 03 
43 .»4 

_ J? - 04 

04 18 

- 13 .34 


8 Thooithtfolntn 

9 I'ersonjl rflatioi 
10 Ma^culiniiy 


In most cases, however, each scorn provides information about 

a new and distinctive aspect of t ^ “ several characteristics that 

The Cuilford-Zhnmerman Inventory has several cn 

it may be well to summarize at this time. 

, i. based upon the responses o. nornta, everyday people, no. ot 

''\°7.fs';^efat;« by study o. the “golnp 

together" ot groups ^heir significance is assumed 

3. Responses arc taken at tace vaiu 
to be given by their obvious content 

4, The respondent "'“V “itnined. 

he wishes; his choices are not forced o 


OUl 43“ * b* — 

UoTv^S Mlmesola Hospital- 



SEIF-BEPORT INVENTORIES 
a number of different groups with 

,0 be definitely abnormal. We cannot J.^mal indi- 

labels to the variation in these traits that “PPj“” f „sons must 
viduals. The interpretation of scores found for norma p 
be made with great caution. amount of 

Hypochondriasis Scale (Hs). This scale asses . ^igh score 

abnormal or excessive concern with bodily ® ^ of 

indicates undue worry about health, often accompa V 
obscure pains and disorders that are difficult to identity. 

Sample Items 

I do not tire quickly. , , , , , 

The top of my head sometimes feels lender, ; 

Depression Scale (D). This scale appraises a tendency to be chron 
Ically depressed, to feel useless and unable to face the futu 
Sample hems 

1 am easily awakened by noise. (+) ... jt 

Everything is turning out just like the prophets ot tne 
would. (+) 

Hysleria Scale (Hy). This scale gels at the tendency to 
sonal problems by developing physical symptoms, such ®y™P p. 
paralyses, cramps, gastric or intestinal complaints, or car ta 
toms. The symptoms tend to appear under emotional stress an 
used as an escape mechanism. 


Sample Items 

I am likely not to speak to people until they speak to me. C+J 
I get mad easily and then get over it soon. (+) 

Psychopathic Deviate Scale (Pd). This scale was based upo"^ 
group who showed absence of deep emotional response, ma 
profit from experience, and disregard for social pressures a 
regard of others. They were individuals who, from their disreg 
social pressures, had tended to get into trouble of various sorts. 


5<imp/e hems 

My family does not like the work I have chosen. (+) 
What others think of me does not bother me. (+) 



lEMPEIMHENr AND AOJUSTMENI INVENTOSIES 339 

^ Paranoia Scale (Pa). Tlic qualities evaluated by this scale are sus- 
ovcisensilivily, aud feeling of beinj picited on or perse- 

Sample Items 

1 am sure I am being talked about (-f) 

Someone has control over my mind, (-f-) 

Psychasthenic Scale (Pt). This scale was based on patients who 
were troubled with excessive fears or with compulsive fendencies to 
dwell on certain ideas or perform certain acts. High score indicates 
resemblance to this group. 

Sample hems 

I easily become impatient with people. (+) 

I wish I could be as happy as others seem to be (+) 

Schizophrenic Scale {Sc). This scale is based upon a group of pa- 
tients characterized by bizarre and unusual thought or behavior, and 
a subjective life tending to be divorced from the world of reality. High 
scores /nd/cate responses sfrai/ar to th/s group. 

Sample Items 

I have never been in Jove with anyone. {+) 

I loved my mother. (-) 

Hypomanla Scale (Afo). This scale evaluates a tendency to be 
overactive both bodily and ojentaUy, with a tendency to skip around 
rapidly from one thing to another. 

Sample hems 

I don’t Wamc anyone for trying to grab everything he can get in this 
world, (-h) 

When I get bored 1 like to stir up some cxcjtement. (-1-) 
Masculinisy-Fenuninlty Scale (Mf). This scale measures interests 
characteristic of the one or the other sex. 

Sample hems 

I like movie Jove scenes. (F) 

3 used to keep a diaiy. (F) 

The MMPI has a number of additional features, and these foevs 
allcnlion on cenain problems lhat arise in using adjustment question 
naircs. The first of these feiiiores is a lie scale (L). This is basrf 
upon fifteen items, imbetMcd in theqoestiounaiie, lhat relate to soctaUy 
approved and virtuous activities lhat are Eenerally approved ol but 
not frequently eartied out. Cenetal population norms indicate ubal 



SELF-REPORT INVENTORIES 

may reasonably be expected on a set of items of this sort. 1'“ P'™" 
m Jlcs an exce^ive number of these socially approved behaviors, it 
r^dered to bTan indication that he tends, consciously or uncon- 
sciously, to distort his report so that he appears in a favorable ligh . 
That is, he tends to “fake good ” ^ 

Another score, the K scale, was built up by keying ^ 

tinguished known abnormals who had presented normal sco e proMe 
from a control group of normals. A high score on this sea e is thou^ 
to indicate a tendency to be very defensive in self-ev.alua ion whereas 
a low score brings out the tendency to be extremely self-cnticai, i.e.. 

The ’ score is based upon the number of ? or undecided response^ 

A very high number is thought to indicate a tendency to evade the tas 
imposed by the inventory: to withdraw from it and fail to face up to . 

One further control scale is the F scale, made up of an assortmeni 
of unrelated items, each of which is marked as true only rarely m 
general population. A high score on this scale is thought to be symp o- 
matic of careless and superficial marking of the inventory: of marking 
items at random or misunderstanding the statements. ^ 

Thus, the authors of the MMP! have introduced a whole senes oi 
control scales, designed to isolate individuals whose responses are un 
trustworthy for one of several different reasons. They recognize, 
first, that good adjustment (and also bad adjustment) can be fa e 
with at least partial success and that before an attempt is made to 
interpret scores on an inventory some guarantee is needed that there 
was not intentional faking. They recognize also that quite uninten- 
tionally individuals differ in the severity of the standards by which they 
judge themselves and that some control is needed on this difference 
in severity of standards. They recognize unwillingness to cooperate 
and inability to comprehend the task or to read the written items, whic 
may show up as superficial and meaningless patterns of responses. A 
of these issues represent real problems to users of an inventory, and the 
control scores represent one wcU-conceived attempt to identify untrust- 
worthy answer sheets. 

In contrast with the Guilford-Zimmerman, we note that the MM" . 


1. Is based upon the distinctive responses of selected groups of 
persons — in this case, groups each presenting a particular psycho- 
pathology. 

2. Has scales that are defined by these abnormal groups. 

3. Is not concerned with the apparent meaning of an item, but only 
with whether it functions — whether it serves to differentiate between 
the abnormal and the control group. 



temperament and adjustment inventories 341 

It thus follow the general paten, of the SIrong VocMhml InleresI 
otons. In common with the GmlJortl-ZJrrmrmim, 

It permits any number of items to be endorsed, leaving the re- 
spondent free of constraint in this regard. 

Let us look now at an inventory that makes use of (he forced*choicc 
pattern of response. 

rWf EDWARDS PERSONAl PREfEKENCE SCHEDUtf 
The Edwards Persona! Preference Schedule tries to assess the 
strength of various needs or motives in the life economy of the indi- 
vidual, Fifteen needs were selected from among those listed by Mur- 
ray,'‘ and items were developed to exemplify each. These are pre- 
sented to the individual in pairs, each need being paired twice with 
each of the 14 others (to make a total of 210 items). Sample pairs 
are: 

A I like to help my friends when A I like to conform to custom 

they are in trouble. and to avoid doing things 

B I like to do my very best in that people I respect might 

whatever I undertake. consider unconventional. 

B I like to talk about my achieve- 
ments. 

The examinee must respond to each pair by indicating which statement 
is more true or more characteristic of him. Knowing how many times 
(out of 28) the examinee chose the option referring to achievement, 
for example, the examiner can refer to (he norms and express n«d- 
achievement as a percentile of the norm group. Edwards made a sys- 
tematic attempt to equate (he staternems in a given pair for social de- 
sirahiliiy, so that individuals would respond as they really felt, and 
not in terms of what is the approved or accepted thing to say, This 
was one way of trying to free scores of the clement of defensiveness 
or “faking good" (hat has been a problem in many of the Inventories 
that have been developed over the years. 

The distinctive features of the EPPS arc, then, 

1. The “forced choice" pattern, which means that each respondent 
must make the same number of choices and the same number of re- 
jections. Thus, no profile can be on all scales, and each profile 
must have about the same number of highs and of lows. Everyone is 
brought to the same general base line. This is true also of the triads 
of the Kiider, 

2. Equating “social desirabaity,** so that any pressure or incentive 
to distort responses or “fake good” is held to a minimum. 



342 


SELF-REPORT INVENTORIES 


PROBLEM CHECK LISTS <pveral 

The instrumenu we have just been dcserib.ng y.e d 
scores. rcpresenUng traits or aspKts of te 

several recenUy publ.shed problem check 1 ' ^ or 

catalogues of problems that are fairly Jt- Tist SR A 

„eople.’^ Examples are the 

Junior Inventory and Youth Inrenlory, and ^ problems 

lens Inventory. Responding to ‘be “mprehe^.ve bs‘ 

provides a kind of uniform problem-finding interview. startin'^ 

fchild marks as matters of concern to him 

point for more intensive inquiry m a face-t^face m e^ , 

Lblems that are marked as troublesome by severf m a 
can serve as the focal point for group guidance sessions. 

EVAtUATION Of TEMPERAMENT AND ADJUSTMENT INVENTORIES 
How well can we hope to describe temperamental “b^^b^ 
and personal adjustment through the individual s response w 
of questions? Perhaps we can clarify the issue by asking wha p 
must do to fill out an inventory adequately. Complet.ng one of the^ 
inventories usually requires that the respondent be (a) able to 
and understand the item, (b) able to stand back and 
behavior and decide whether the statement is or is 
and (c) wiUing to give frank and honest answers. ^ 

points raises certain issues about the validity of self-report l 
One problem in inventories of all types is that of rra i g 
This problem U partly one of sheer amount of reading. Especia y 
those inventories that try to appraise several different traits, it is u . 
necessary to have several hundred items to provide enough ^P 
reliability. The slow reader may have trouble getting throng so 
verbiage, or may start responding without really reading t rou_ 
item. The problem is partly one of level of reading, i.e., of the co 
plexity of structure and abstractness of ideas involved. If the voca 
lary or concepts are beyond the respondent’s comprehension, he ma> 
again give up the attempt really to understand and may respond m ^ 
superficial or random fashion. (The F scale of the MMP/ was 
signed to protect against this hazard.) Thus, inventories are of qu 
tionable value for those of low literacy, be they adults or cWIdren. 

A second problem is that of self-insight. Inventories require 
individual to conceptualize and classify his own behavior to eci ^ 
whether certain descriptions or classifications of behavior are trne o 
him. This implies a certain ability to stand back from himse an 



temperament and adjustment INVENTORfES 343 

view himself obicclively iha may be dMeelt to achieve. In fact, the 
person whose adjustment is most unsatisfactory may be the one who 
IS least able to achieve this objectivity and to face his own deficiencies. 
Studies have shown repeatedly that those who are rated low by their 
associates on some desirable trait tend to grossly over-rate themselves, 
^us, the ill-tempered girl is likely not to recognize her own irasci- 
bihty; the overbearing boy may be unaware of his boorishness. 

When inventories arc built according to the pattern of the Strong 
VIB or the MMPt, such a lack of self-insight may not be of crucial 
importance. For these inventories, the keying of an item is based not 
on its obvious content but on the empirical fact that it did distinguish 
between criterion groups. If Ffenty has marked that he would like to 
be an architect, he has behaved in the way engineers typically behave. 
The question of whether engineers on the one hand or Henry on the 
other really want to be architects is not central to our interpretation. 
The point is that they have both reacted to the question in (he same 
way, so we give Henry a credit on the engineer key of the Strong. On 
the other hand, where items and scores are interpreted on the basis of 
their manifest content and taken at face value, as is true of the Gw/V- 
forti-Zimmerman inventory or the Kuder Prcjerence Record, non-in- 
slghtful responses will lead to an untrue picture of the person who 
makes them. 


A third problem is the willingness of the respondent to reveal the 
way he perceives or feels about himself. For personality inventories, 
frank and honest response by the examinee is essential for a valid pic- 
ture. In most cases, the general significance of the items is reasonably 
apparent to the reader. Most subjects can follow successfully instruc- 
tions to fake in a particular way. Even when the subject cannot fake 
successfully, if he tries to do so he will certainly give a distorted picture 
of himself. Inventory scores will only be useful when most respond- 
ents arc answering in (he way that they consider to represent them- 
selves. The importance of providing protection against distortion is 
sufficiently great so that control scores to detect it have been introduced 
into the MMPI and certain other inventories. 

This means that pertonahty or adjustment inventories cannot be 
used, or can be used only with caution, when the examinee feels threat- 
ened by the test or feels that it may be used against him. Inventories 
have not generally proved useful in an employment situation, perhaps 
for this reason. If an inventory is ^ven to elementary school pupils 
(and perhaps in high school and college) in the typical school setting, 
in which a test is something to do your best on and the teacher is often 
someone to get the best of. one is inclined to doubt whether many of 



344 SELF-REPORT INVENTORIES 

the pupils will be willing to reveal personal shortcomings that they may 
be aware of. Generally speaking, in any practical situation we should 
consider an adjustment inventory to be no more than a preliminary 
screening device that will locate a group of individuals who may be 
having problems of adjustment or may be in conflict with their envi- 
ronment. Final evaluation should always await a more personal and 
intensive study of the individual. Furthermore, a good score on an 
adjustment inventory is not a guarantee of good adjustment; it may 
characterize a person who is protective, defensive, or unable to face 
and to acknowledge very real problems. 

Personality inventories are a product of the middle-class American 
culture. The extent to which hems have equivalent meaning for other 
national cultures, or even for the lower socio-economic level in Amer- 
ica, has not been fully explored. Some additional caution is necessary 
in interpreting results for members of other cultural groups. 

Evidences of Validity. Those inventories that have been developed 
as measures of adjustment usually show a moderate level of concur^ 
rent validity. That is, they differentiate between groups established on 
other grounds as differing In adjustment. Thus, the MMPI was set up 
to distinguish between diagnosed pathological groups and normals and 
continues to do so in new groups. Other inventories have been tested 
by their ability to differentiate less extreme groups and have stood up 
fairly well under the test. 

Vhien it comes to predictive validity, the results are less encourag- 
ing. In civilian studies,’**'**'* inventory scores have generally failed 
to predict anything much about the future success of the individual 
either in school, on the job, or in his personal living. Military experi- 
ence * with these instruments has been somewhat more promising. 
There have been a number of studies showing substantia} relationship 
between scores based on inventories and the subsequent judgment 
resulting from a psychiatric interview. Relationships to subsequent 
discharge from the service have also been sufficiently good to indicate 
that an inventory could serve a useful function as a device to screen 
for careful interview those who appeared to be potential misfits. 

The Practical Use of Temperament and Adjustment Inventories. 
We must now ask what use should be made of temperament and ad- 
justment inventories in and out of school. In the light of the factors 
that can distort scores and the limited validity these instruments have 
shown as predictors, we must conclude that they should be used very 
sparingly. Our feeling is that an adjustment inventory should be used 
only as an adjunct to more intensive psychological services. If facili- 



ATinUDE QUESTIONNAIRES 345 

lies are arailable to permit intensive study of some of the group by 
psychologically trained personnel, an inventory may serve as a means 
of identifying persons likely to profit from working with a counselor. 
However, there is little that a classroom teacher can do to dig behind 
and test the meaning of an inventory score. Accepted uncriticaliy, 
the score may prove very misleading. We bcheve that iittle useful pur- 
pose is served by giving an adjustment inventory and making the re- 
sults available to the teacher, especially the teacher of an elemental- 
school child. 

The multi-dimensional temperament inventories arc siiil too new 
for us to have much evidence on the social or practical validities of the 
dilTerent scales. Their use for vocational guidance or personnel selec- 
tion can hardly bo recommended ai the present time. It may be that 
persons having certain patterns of temperamental characteristics should 
be guided towards or away from certain types of Jobs. This seems 
plausible to many people. However, our information about the per- 
sonality patterns in specific occupations is too limited, and the range of 
variation within occupations is probably too wide to make much prac- 
tical use of such personality appraisals at the present time. 


ATTITUDE QUESTIONNAIRES 

One further type of self-report inventory deserves brief mention. 
This is the attitude questionnaire, designed to appraise an individual's 
favorableitcss toward some group, proposed action, social institution, 
or social concept. Opinion polling has become commonplace in the 
last 25 years. However, this involves attitude measurement in only 
the most rudimentary sense. One or more questions arc asked, and 
a count is made of the frequency of responses In hvo or three broad 
categories. Polls of this sort may be used in the schools to get an ap- 
praisal of public opinion of (he school's patrons, or to study the status 
or change of pupils' expressed beliefs after instruction. The schools 
express a good deal of concern about development of attimdes and 
ideals as educational objectives, so there is need for devices to 
appraise the extent to which such outcomes are being achieved. Indus- 
trial morale surveys are another point at which practical use may be 
mads of atillodo measurOTent, But the greatest use of “P" 

praisa! devices op to the prcsatnl time has probably been for 
studies of factors related to attitude differences, 
that produce changes in altitude, or the .nffuence of attitudes upon our 
perception of our world. 



SELF-REPORT INVENTORIES 

346 . r ♦ ♦ 

•n,e typical attitude questionnaire is made up of a f 

„ems S the individual may either endorse or reject. There are 

two main patterns; 

I Seded Stalemenis. In this form, statements are scaled in term 
of their decree of favorableness on the basis of extensive pre imi 
Jork Thu f tve are preparing this type of attitude scale toward he 
Untd ^^dons, we starTwith a large pool of items. They may tnelude 
the following: 


The UN is a strong influence for peace. 

The UN is a waste of time and effort. 

The UN does about as much harm as good. 

The UN is the most important force for good in 

A corps of judges is assembled and each judge is “^he'l to sort these 

statements into a set of piles, each pile representing a ® 

of favorableness toward the UN. The judge ts 

meat or disagreement with the statement; he is giving his 

of its meaning and significanec. Each statement receives a scale va o 

based on the average of these judgments and an ambiguity index basea 

upon the spread of the ratings. (The more the judgments spread out. 

the more ambiguous the statement is.) From the pool of items 

out, about twenty arc chosen that spread out over the range o 

values and are relatively unambiguous. These constitute the altitu 


When this type of attitude scale is administered, the respon cn 
marks all the statements with which he agrees. His score is the average 
of the scale s-alucs of the statements he endorses. 

2. Summed Score. In the other common format, the basic state- 
ments arc much the same, except that neutral statements arc 
Each statement is unequivocally either favorable or unfavorable. c 
respondent reacts to each statement on a five-point scale, ranging from 
strong agreement to strong disagreement. Thus, a section of a ques- 
tionnaire in this formal might read: 


The UN is a strong Strongly Agree Unccr- 

influence for peace, agree tain 

The UN will only Strongly Agree Unccr- 

makc IfouWe. agree tain 


Disagree Strongly 
disagree 
Disagree Strongly 
disagree 


The questionnaire can be scored quite simply by giving five points or 
strong endorsement of a favorable statement, four points for agree 
ment. three points for uncertainty, and so forth, fhe scoring is 
versed for the unfavorable statements. An individual s raw score 



SUMMARY ANO EVALUATION 347 

the sum of his scores for the separate items. The raw score can, of 
course, be converted into a perccntfle or standard score if this seems 
desirable. 

Both forms of attitude scale usually have satisfactory reliabilities, 
typically in the .80 s. The two t^jcs of scales yield scores that inter- 
correlate very highly, and for most practical purposes there does not 
seem to be a great deal of choice ^tween them. The greater sim- 
plicity of preparation of a summed-score type of inventory will com- 
mend it to most persons who wish to use an attitude scale as an aspect 
of some type of educational evaluation or research project. In either 
case, the scale will yield only a single general favorablencss-unfavor- 
ableness score for an attitude area. Any qualitative variations within 
the broad area are blurred. Recent investigations on attitude scale 
development have been concerned with identifying more restricted and 
more homogeneous subscales within a larger attitude domain. A series 
of homogeneous subscales within a larger attitude area (toward the 
UN, for exatnpie) should permit mapping out in a more analytic and 
diagnostic way the prodie of an individual’s or a group’s attitudes. 

The big qualification about attitude scales is that they operate pure!)’ 
on a verbal level. The individual doesn’t do anything to back up his 
stated attitude. The scales deal with verbalired attitudes rather than 
actions. Of course, an attitude scale is obviously fakeable. If we 
recognize (hat they represent the verbalized attitude that the individual 
is willing to express to us and work within that limitation, attitude 
scales appear to be a useful research tool or tool for experimental 
evaluation of educational objectives lying outside the domain of knowl- 
edges and skills. 

SUMMARY AND EVALUATION 

In this chapter we have considered self-report inventories as instru- 
ments for studying personality. An inventory of this sort is essentially 
a standard set of interview questions presented in written form. 

The individual’s report about ftimse/f has one autstjftdins advan- 
tage. It provides an "inside” view, based on all the individuals ex- 
perience with and knowledge about himself. However, self-reports 
are limited by the individual s limited 

1. Ability to read the questions with understanding. 

2. Self-insight and self-understanding. 

3. M'illingness to reveal himself frankly. 



348 SEIF-REPORT INVENTORIES 

One type of questionnaire that has proven valuable in selection and 
placement is the biographical data blank, in which the individual pro- 
vides factual information about his past history. A scoring key devel- 
oped for the particular job has been found to have useful validity in 
several different instances. 

Interest inventories provide satisfactorily reliable descriptions of in- 
terest patterns. These patterns persist with a good deal of stability, at 
least after late adolescence, and appear to be significant factors for 
vocational planning. 

The validity of adjustment and temperament inventories is more 
open to question. Inventories of all types can be distorted to some 
extent if the individual is motivated to distort his responses. Thus, 
the integrity of the responses depends upon the motivation of the per- 
son examined. This depends, in turn, upon the setting in which and 
purposes for which the inventory is used. In school, industrial, or 
military use of adjustment inventories, one suspects that the motiva- 
tions may often favor distorted responses. In any event, inventories 
of this type have not generally shown high validity. They should be 
used only with a good deal of circumspection. 

Attitude questionnaires have been developed to score the Intensity 
of favorable or unfavorable reaction to some group, institution, or 
issue. Though these represent only verbal expressions of attitude, they 
are useful research tools. 


REFERENCES 

1. Ellis, A., Recent research with personality inventories, /. consitli. Psy’’ 
dial., 17, 1953. 45^9. 

2. Ellis, A., The validity of personality questionnaires. Psychol. Bull, 
43, 1946, 385^40. 

3. Ellis, A., and H. S. Conrad, The validity of personality inventories in 
military practice, Psychol Bull, 45, 1948, 385-426. 

4. Fear, Richard, The evaluation interview: prediction of job performance 
ill business and industry, New York, McGraw-Hill, 1958. 

5. Frandsen, A., Interests and general educational development, J. appl 
Psychol. 31, 1947, 57-65. 

6. Garry, R., Individual dilTerences in ability to fake vocational interests, 
/. appl Psychol., 37, 1953. 33-37. 

7. Ghiselli, E. E„ and R. P, Barthol, The validity of personality inven- 
tories in the selection of employees, /. appl. Psychol., 37, 1953, 18-20. 

8. Longstaff, H. P Fakability of Ibe Strong Interest Blank and the Kuder 
Preference Record, J. appl Psychol, 32, 1948. 360-369. 

9. Mallinson, G. G., and W. M. Cmmrine, An investigation of the stabil- 
ity of interests of high school students. J. ediic. Res., 45, 1952, 369-383. 



OUESTfONS fO« OlSCUSStON 345, 

10. ^JcCull). c. Harold, The vaUdity of (he Kuder Preference Record. 

1954 George Washington Uni\’crsilj'. Washington, D. C. 

11. Murra), H. A., et al.. Exphrations in personality. New York Oxford 
University Press. 1938 

12. Rosenfwrg, N., Stahiltty and maturation of Kuder interest patterns dur- 
ing high school. Cdiic pnehoi. A/eat.. 13, 1953. 449-458. 

U. Strong. E. K.. Interest scores while in college of occupations engaged 
in 20 years later. Cdiic psyched. Meas.. 11, 1951. 335-348. 

14. Strong. E. K., Ninetccn-ycar followup of engineer interests, J appl 
Psychol 3C, 1952. 65-74. 

15. Strong. E. K.. Permanence of interest scores o\er 22 years J anal 
Psychol., 3S, 1951, 89-91. 

1 6. St rong. E. K., Vocational mrerests of men and women, Stanford, Calif., 
Stanford University Press. 1943. 

17. Super, D. E., The Bernreuter Personality Inventory; a review of re- 
search. Psycho/. Bull, 39, 1942, 94-125. 


SUGGESTED ADDITIONAI READING 

Allen, Robert M.. Personality assessment procedures. New York. Harper, 
1958, Chapters 2-7. 

Bass, Bernard M. and Irwin A. Berg. Editors. OA;ecin-e approaches to per- 
sanaliry assessment. New York. Van Nostrand, Chapters 1, 3, 5 and 6. 
Cronbach, Lee J., Essentials of psycholo$ical 2nd ed.. New York, 
Harper, I960. Chapters 14-16. 

Darlcy, John G.. and Theda Hagenah. Vocational interest measurement, 
Minneapolis. TTie University of Minnesota Press, 1955, Chapters 2, 4. 6. 
Harris, Chester \V., Editor, Encyclopedia of educational research, 3rd ed , 
New York, Macmillan, I960, pp. 102-112. 728-732. 

Guilford, J. P., Personality, New York, McGraw-Hill, 1959, Chapters 8-9. 
Kuder, Frederic G.. Kuder preference record occupational. Form D, re~ 
search handbook, 2nd ed.. Chicago. Sdence Research Associates. 1957. 
Layton, Wilbur L.. Editor, The strong vocational interest blank: research 
and uses, Minneapolis, University of Minnesota Press, 1960, 

QUESTIONS FOR DISCUSSION 

1 . How satisfactory is the method that was used in validating the Strong 
Vocational Interest Blank? What limitations do the procedures have? In 
what ways should they be checked? 

2. What are the relative advantages of the Sirons Vocational Interest 
Blank and the Kuder Preference Record? Under 'that circumstances 
would you choose to use one and under what circumsiancei. the other.' 

3. What is the relationship between mea«ires of interest and measures 
of ability? What does this suggest as to the ways in which the two types of 
tests should be used? 



350 SELF-REPORT INVENTORIES 

4. Most civilian studies have failed to find interest or adjustment inven- 
tories verj’ useful in personnel selection. What arc the reasons tor this? 

5. Why are most published interest invcniorics intended for use with 
secondary-school pupils, college students, and adults rather than elementary- 
school students? 

6. What uses could a classroom teacher make of results on the Kttder 
Interest Inventory other than in giving vocational and educational guid- 
ance? 

7. In what waj's could a biographical data blank help a teacher in 
understanding the pupils in a class? What types of information would be 
useful to include on such a blank? 

8. What conditions must be met if a self-report inventory is to be filled 
out accurately and gis'e meaningful results? 

9. How much trust can we place in adjustment inventories given in 
school to elementary-school children? What factors limit their value? 

10. What Important differences do you notice between the GuUford- 
Zimmerman Temperament Suri'ey and the Minnesota MuUiphasic Person- 
ality Inventory? For what purposes would each be more suitable? 

11. What purposes arc served by the control scales (L. K, F. ?) on the 
Minnesota MuUiphasic Personality Inventory? NVhal would be the com- 
parable issues in personality rating scales? How- might one adapt the ideas 
of control scales to ratings? 

12. ^Vhat factors limit the usefulness of paper-and-peneil aittifude scales? 
^Vhat other methods might a teacher use to evaluate attitudes? 

1 3. Prepare the rough draft for a brief attitude scale to measure teachers’ 
attitudes towards objective tests. 

14. With what kinds of groups can adjustment inventories be used most 
satisfactorily? 



Chapter 13 


r 

The Individual as Others See 
Him 


In the last chapter we considered the information about personaJity 
that could be poucn from invemories in which the Individual describes 
himself. A second main way in which an individual's personality 
shows itself is through the impression he makes upon others. The 
second person serves as a reagent reacting to the first personality. 
How well does A like B? Does A consider B a pleasing person to 
have around? An effective worker? A good job risk? Does A con- 
sider B to be conscientious? Trustworthy? Emotionally stable? 
Questions of this sort are continually being asked of every teacher, 
supervisor, former employer, minister, or even friend. We must now 
inquire how fruitful it is to raise such questions and what precautions 
must be observed if the questions are to receive useful answers. 

We shall first give brief consideration to the unstructured letter of 
recommendation. Then we shall examine rating scales and rating pro- 
cedures. Finally, we shall consider some special forms of rating; 
nominating techniques and forced-choice rating procedures. 

LETTERS OF RECOMMENDATION 

The most fluid form for getting an impression of one person through 
the eyes of a second person is to invite the second person to talk or 
write to you about him. Such a communication could be obtained 
in any setting. However, the setting in which it most commonly does 
occur is when person A is a candidate for something; admission to a 
school, a scholarship or fellowship, a job, membership in a club, or 
a security clearance. He then fcimishes the institution, placement 
agency, or employer the names of people who know him ivell or know 
him in a particular capacity, and that agency obtains statements about 
A from B and C, who know him. 

How useful and how informative is the material that is included in 
free, unstructured communications describing another person? Actu- 
051 


352 THE INDIVIDUAt AS OTHERS SEE HIM 

ally, in spile of the vast numbers of recommendations written every 
year, ver)' little of a solid and factual nature is known about their 
adequacy or the effectiveness with which they discharge their function. 
Opinion covers the full gamut from a belief that a free and uncon- 
strained letter about an applicant is the best possible way to gel an 
evaluation of him to the conviction that letters of recommendation are 
completely worthless, from a conviction that the letter of recommenda- 
tion is the core of any selection program to a feeling that the best 
thing to do with recommendations is to bum them. But factual studies 
of the reliability and %'alidity of the information that is gotten from a 
letter of recommendation or of the extent to which recommendations 
influence the action taken with respect to an applicant arc fragmentary 
in the extreme. 

The letter of recommendation is such an unstructured document that 
it is very hard to study by sound research techniques. However, sev- 
eral investigators have attempted to make analyses of the content of 
the letters and to scale them with respect to the enthusiasm of the 
endorsement they provided. A moderate degree of agreement has 
been found * between dilTerent letters written about the same person. 
Within a group of applicants for jobs in secondary-school teaching 
from one teacher-training institution the bclwecn-leltcrs reliability 
would be represented by a correlation of about .40. There was some 
evidence in this same study that the letters of those who got the Jobs 
were a little higher on the enthusiasm scale than letters of applicants 
who were not employed. However, another study failed to find any 
difference between the terms used to describe Job getters and other ap- 
plicants. 

The extent to which a letter of recommendation provides a valid 
appraisal of an individual and the extent to which it is accurately diag- 
nostic of outstanding points, strengths or weaknesses, is almost com- 
pletely unknown. However, we cannot be very sanguine. Most of 
the limitations that we shall presently discuss in connection with more 
structured rating scales apply with at least equal force to uncontrolled 
letters. In addition, each respondent is free to go off in whatever di- 
rection his fancy dictates, so that there is no core of content common 
to the different letters about a single person or to the letters dealing 
w'Uh different persons. One letter may deal with A’s social charm; 
a second, with B’s integrity; and a third, with Cs originality. On what 
common base are we to compare the three? Add to this the facts 
that (1) the applicant usually is more or less free to select the persons 
who will write about him and may be expected to pick those who will 
support him and that (2) rccommenders differ profoundly in their pro- 
pensity for using superlatives, and the prospect is not a very rosy one. 



3J3 


RATING SCAtES 


Further research smdies of the vaHdIly of free descriptions of one 
person by his fellows are urgently needed. In the meantime, recom- 
mendations will continue to be written — and perhaps to be used, Wc 
must turn our attention to more structured evaluation procedures. 


RATING SCAtES 

Undoubtedly it was in part the extreme subjectivity of the unstruc- 
tured statement, the lack of a common core of content or standard of 
reference from person to person, and the extraordinary difficulty of 
quantifying the materials that gave impetus to the dcvclopmeni of rat- 
ing scales. Rating procedures attempt to overcome just these deficien- 
cies. They attempt to get appraisals on a common set of attributes 
for a]] raters and ratees and to have these expressed on a common 
quantitative scale. 

Wc ail have had experience with ratings, either in making them or 
in having them made about us or, more probably, in both capacities. 
Rating scales appear in a large proportion of school report cards, more 
clearly In the non-acadcroic part. Thus, wc often find a section 
phrased somewhat ss Mha's: 

ls( 3nd 3rd 4ih 

Period Period Period Period 

Effort 
Conduct 
Citizenship 
Cooperation 
Adjustment 

H <= superior 

Many civil service agencies and industrial firms send rating forms out 
to persons listed as references by Job applicants, asking for evaluations 
of the individual’s "initiative,” "originality," "enthusiasm," or "ability 
to get along with people.” These same companies or agencies often 
require supervisors to give merit ratings of their employees, rating them 
as "superior,” “excellent,” “veiy good,” "good." ‘ satisfactory or un- 
satisfactory” on a variety of traits or in over-all usefulness. Colleges, 
medical schools, fellowship programs, and still other agencies call for 
ratings as a part of their selection procedure. Beyond these practical 
operating uses, ratings have been involved in a great many research 
projects. All in all, vast numbers of ratings are called for and given, 
often reluctantly, in our country week by week and month by month- 
Rating other people is a I-irge-scalc operation. 

The most common pattern of rating procedure presents the rater 



S = satisfactory U >= unsatisfactory 



354 


THE INDIVIDUAL AS OTHERS SEE HIM 


with a set of trait names, perhaps somewhat further defined, and a 
range of numbers, adjectives, or descriptions that arc to represent 
levels or degrees of possession of the traits. He is called upon to rate 
one or more persons on the trait or trails by assigning him or them the 
number, letter, adjective, or description that is judged to fit best. Two 
illustrations are given of rating scales, drawn from a program being 
developed for evaluation of management personnel.* The first is one 
of a scries of trait ratings. This pan of the evaluation instrument calls 
for ratings of the following traits: job know-how, judgment, leadership, 
ability to plan and organize, communication ability, initiative, depend- 
ability, and human relations. For the trait of leadership, the rater is 
instructed as shown below. The actual rating scale follows these in- 
structions. 


LEADERSHIP 

Consider his ability to inspire confidence. How much respect does he com- 
mand as an individual, not merely because of his position? Do people look 
to him for decisions? Is he afraid to “stick his neck out" for what he be- 
lieves? Docs he have teamwork? 

Completely lack- 
ing. Deflnitely a 
follower with 
equals. Does not 
try to convince 
others that his 
way is best. 

□ 

Tries to lead 
with some suc- 
cess, but has 
never achieved a 
strong position. 

Is passive in di- 
recting his sub- 
ordinates. 

□ 

Good leader. 
People wait to 
hear what he 
has lo say. Re- 
spected by col- 
leagues. People 
call for his 
opinion. 

a 

Exceptional leader. 
Able to take over 
and pul! things Into 
shape. People seem 
to enjoy going along 
on his side. Is re- 
spected by subordi- 
nates and colleagues- 

n 


An over-all summary rating is also called for, and this takes the 
form shown below. 


Please place a mark on the scale to best show the over-all rating 
of this man in his present position. 


Not meeting Fair, but Satis- Doing Excellent 

the require- needs to factory good job 

ments improve work 


• These have been made available through the courtesy of the Personnel De- 
partment of Mack Trucks, Inc. 





PROeifMS IN OBTAINING SOUND RATINGS 355 

wTlhll “f “ ra"SE of rating instrumems. 

c sha» turn presently to some of the major variations in taitng pat- 
terns. Right now. however, let us consider some of ihe problems that 
arise when we try to get a group of judges to make these appraisals. 


PROBLEMS JN OBTAINING SOUND RATINGS 

The problems in obtaining valid appraisals of an individual through 
ratings are of two main sorts. There arc first the factors that limit the 
rater's wilijngness to rale honestly and conscientiously, in accordance 
with the instructions given to him. There are secondly the factors that 
limit his ability to rate consistently and correctly, even with the best 
of intentions. We shall need to consider each of these in turn. 

MCTOW AffSCTINO THE RATER'S WILLINGNESS TO RATE CONSCIENTIOUSLY 

When ratings arc collected, it is commonly assumed that each rater 
is trying his best to follow the instructions that have been given him, 
and that any shortcomings in his ratings arc due entirely to human fal- 
libility and ineptitude. How'cvcr, this Is not necessarily true. There 
are at least two sets of circumstances that may impair the integrity 
of a set of ratings: ( 1 ) The rater may be unwilling to fake the trouble 
that Is called for by the appraisal procedure: and (2) the rater may 
Identify with the person rated to such an extent that he is unwilling to 
make a rating that will hurt biro. Each of these merits some elabo- 
ration. 

Umvillinsness to Take the Necessary Pains. At best, ratings are a 
bother. Careful and thoughtful ratings arc even more of a bother. 

In some rating procedures the attempt is made to get away from sub- 
jective impressions and superficial reaction by introducing elaborate 
procedures and preeautiems into the rating enterprise. Thus, in one 
attempt to improve efficiency rating procedures for Air Force officers,’* 
an elaborate form was introduced that was to serve as a combined ob- 
servational record and rating form. Fifty-four specific critical be- 
haviors were described relating to officer efficiency. Scales wre pre- 
pared describing degrees of excellence in each type of behavior. The 
accompanying instructions called upon raters to observe their ratees 
for a period before the official ratings tvere to be given and to tally on 
the rating form Instances that had been observed of desirable and un- 
desirable acts within eaclr of the behavior categories described on the 
scale. After a year or two of use this form was discarded, in part at 
least because of its complexity and because raters were not willing to 
devote (he time and thought that would have been required to maintain 



356 THE INOiVlDUAt AS OTHERS SEE HIM 

the preliminary obscrvaiional records on which the ratings were to be 

based. 

In a lesser degree, one suspects that pcrfunctoriness in carrying out 
the operation of rating is a factor contributing to lowered cfTccUvencss 
in many rating programs. Particularly if the number of pupils or cm* 
ployces to be rated is large, the task of preparing periodic ratings can 
become a decidedly onerous one. Unless raters arc really “sold’ on 
the importance of the ratings, the judgments arc likely to be hurried 
and superficial ones, given more with an eye on finishing the task 
than v-ith a concern for making accurate and analytical judgments. 

Identification with the Persons Being Rated. Ratings are often 
called for by some rather remote and impersonal agency'. The Civil 
Sersnee Commission, the Military Peisonnel Division of a remote 
Headquarters, the personnel director of a large company, or the central 
administrative stall of a school system are all pretty far away from 
the first line supervisor, the squadron commander, or the classroom 
teacher. The rater is often closer to the persons being rated, the work- 
ers in his office, the junior officers in his outfit, the pupils in his class, 
than to the agenc)’ that requires the ratings to be made. One of the 
first principles of supervision or leadership is that the good leader looks 
out for the needs and welfare of his followers or subordinates. Morale 
in an organization depends upon the conviction that the leader of the 
organization will take care of the members of the group. When rat- 
ings come along, “taking care of” becomes a matter of seeing to it that 
one’s own men fare as well as — or a little better than — those in com- 
peting groups. 

All this boils dowm to the fact that in some situations the rater is 
more interested in providing a “break” for the people whom he is rat- 
ing and in seeing that they get at least as good treatment as other 
groups than he is in prodding accurate information for the using 
agency. This situation is aggravated in many governmental and offi- 
cial agencies by a policy of having the ratings public and requiring that 
the rater discuss with the person being rated any unfavorable material 
in the ratings. A further aggravation is produced by setting up ad- 
ministrative rulings in which a minimum rating is specked as required 
for promotion or pay increase. No wonder, then, that ratings tend to 
climb or to pile up at a single sc^Ie point. Thus, in certain governmen- 
tal agencies during World War II the typical rating, accounting for a 
very large proportion of the ratings given, was “excellent.” “Very 
good ’ became an expression of marked dissatisfaction, while a rating 
of “satisfactory” was reserved for someone you would get rid of at 
the first opportunity. 



PROBLEMS IN OBTAINING SOUND RATINGS 357 

It is important to realize that a rater cannot always be depended 
upon to work wholeheartedly at giving valid ratings for the benefit of 
the using agency, that making ratings is usually a nuisance to him, and 
that he is often more committed to his own subordinates than to an 
outside agency. A rating program must be continuously “sold” and 
policed if it is to remain c0ecUve. And there are limits to the extent 
to which even an active campaign can overcome a rater’s natural in- 
ertia and interest in his own little group. 


PACrORS APPECriNG THE RATER'S ABIIITY TO RATE ACCURATEtT 


Even when a group of raters are presumably well motivated and 
doing their best to provide valid judgments, there are still a number 
of factors that operate to limit the validity of those judgments. These 
center around the lack of opportunity to observe, the covertness of the 
attribute, ambiguity of the quality to be observed, lack of a uniform 
standard of reference, and specific rater biases and idiosyncrasies. 

Opportunity to Observe the Person Hated. One factor that must 
always be borne in mind as a consideration limiting the accuracy of 
rating procedures Is limited opportunity on the part of the rater to 
observe the person being rated. Thus, the high-school teacher teach- 
ing four or five dilTerent class groups of 30 pupils each and seeing many 
pupils only In a class setting may ^ called upon to make judgments as 
to the “initiative” or “flexibility” of these pupils. The college instruc- 
tor who has taught a class of 100 pupils will receive rating forms from 
an employment agency or from the college administration asking for 
similar judgments. The truth of (he matter is that effective contact 
with the person to be rated has probably been too limited to provide 
any adequate basis for the judgment that is being requested. True, 
the ratce has been physically in the presence of the rater for a good 
many hours, possibly several hundred, but these have been very busy 
hours, concerned primarily with other things than observing and form- 
ing judgments about pupil A. Pupil A has had to compete with pupils 
B, C, D, and on to Z and also with the primary concern with teaching 


rather than judging. 

In a civil service or industrial setting much the same thing is true. 
The primary concern is with getting the job done, and although in 
theory the supervisor has had a good deal of time to obsers-e each 
worker, in practice he has been busy with other things. We may be 
able to “sell” supervisors on the idea of devoting more of their energy 
to observing and evaluating the persons working for them, but there 
are very real limits to the amount of effort that can be withdrawn from 
a supervisor’s other functions to be applied to this one. 



358 THE JNOIVIDUAI AS OTHERS SEE HIM 

We face not only the issue of general opportunity to observe, but 
also that of specific opportunity to observe a particular aspect of the 
individual’s personality. Any person secs another only in certain lim- 
ited contexts, in which only certain aspects of his behavior are dis- 
played. The teacher sees a child primarily in the classroom, the fore- 
man sees a workman primarily on the production line, and so forth. 

We might question whether a teacher in a thoroughly conventional 
classroom has seen a child under circumstances which might be ex- 
pected to bring out much “initiative” or “originality.” The college in- 
structor who has taught largely through lectures is hardly well situated 
to rate a student’s “presence” or “ability to work with individuals.” 
The supervisor of a clerk doing routine work is poorly situated to ap- 
praise “judgment.” Whenever ratings are proposed, cither for research 
purposes or as a basis for administrative actions, we should ask with re- 
spect to each trait being rated: Has the rater had a chance to observe 
these people in enough of the sorts of situations in which they could 
be expected to show variations in this trait so that his ratings can be 
expected to be meaningful? If the anstver is “No,” vve would be well 
advised to abandon the ratings. 

In this connection, it is worth while to point out that persons in dif- 
ferent roles may see quite different aspects of the person to be rated. 
Her pupils see a teacher from quite a different vantage point than does 
the principal. Qassmates in Officer Candidate School have a different 
view of the other potential officers than does the drill instructor- In 
getting ratings of some aspect of an individual, it is always appropriate 
to ask who has the best chance to see the relevant behavior displayed. 
It would normally be to this source that we should go for our ratings. 

Covertness of Trait Being Rated. If a trait is to be appraised by 
an outsider, someone other than the person being rated, it must show 
on the outside. It must be something that has its impact on the out- 
side world. Such characteristics as appearing at ease at social gather- 
ings, having a pleasant speaking voice, and participating actively in 
group projects are characteristics that are essentially social- They 
appear in interaction with other persons and are directly observable. 
They are overt aspects of the person being appraised. By contrast, 
attributes such as “feeling of insecurity,” “self-sufficiency,” “tension,” 
or “loneliness” are inner personal qualities. They are private aspects 
of personality and can only be crudely inferred from what the person 
does. They are covert aspects of the individual. 

An attribute that is largely covert can be judged by the outsider only 
with great difficulty. Little of inner conflict or tension shows on the 
surface, and where it does show it is often in masquerade. Thus, a 



PfiOBlEMS IN OBTAINING SOUND RATINGS 359 

child’s deep Insecurity may express ilself as aggression against other 
TOils in one child, or as wiBidrairal into an inner nwld in another. 
The insecurity is not a simple dimension of overt behavior. It is an 
underlying dynamic factor that may break out in different ways in dif- 
ferent persons or even in the same person at different limes. Only a 
thorough knowledge of the Individual, combined with a good deal of 
psychological insight, makes it possible to infer from the overt behavior 
the nature of his underlying covert dynamics. 

One can see, thenr that rating procedures will be relatively unsatis- 
factory for the inner, covert aspects of the individual. Qualities that 
depend upon very thorough understanding of a person plus wise infer- 
ences from his behavior will be rated with low reliability and little 
validity. Ratings have most chance of being accurate for those quali- 
ties that show outwardly as a person interacts with other people, the 
overt aspects. Experience has shown that these can be rated more rc- 
Jlably, and one feels confident that they are rated more validly. The 
validity lies in part in the fact that these social aspects of behavior have 
their meaning and definition primarily m the effects of one person or 
another. 

A/nbigulty of Meaning of Dimension to Be Hated. Many rating 
forms cal] for ratings of quite broad and abstract traits. Thus, in our 
JJIustraiion on p. 353 we included, among others, ‘‘ciiwenship" and 
“adjustment.” These are neither more nor less vague and general than 
the attributes included in other rating schedules. But what do we 
mean by “citiaenship” (n an elementary-school pupil? By what actions 
is “good citizenship" shown? Docs it mean not marking up the walls? 

Or not spitting on the floor? Or not pulling little girls’ hair? Or bring- 
ing newspaper clippings to class? Or joining the Junior Red Cross? 

Or staying after school to help the teacher clean up the room? What 
docs it mean? Probably no two raters would have just exactly the 
same things in mind when they rated a group of pupils on ‘citizenship. 

Or consider "initiative.’’ “personaJity.’’ “supervisory ability,” “men- 
tal flexibility,” “c.xecutive influence,” or “adaptability. ’ These are all 
examples from rating scales in actual use. Though there is certainly 
some core of uniformity in the meaning that each of these terms w 
have for different raters, there is vrith equal certainty a good deal of 
variability in meaning from one rater to another. In proportion as a 
term becomes abstract, it> meaning becomes variable from person 
to person, and such qualities as those listed above are conspicuously 

The relies that a givea child wffl receive for “eimemhip will, Ihea, 
depend npon whal "eilizensbip" means to the rater. If .1 means to 



360 


THE INDIVIDUAL AS OTHERS SEE HIM 

rater A conforming to school regulations, he will rate certain children 
high. If to rater B it means taking an active role in school projects, 
the high ratings may go to quite difTercnt children. A first problem 
in getting consistent ratings is to achieve consistency from rater to rater 
in the meanings of the qualities being rated. 

Uniform Standard of Reference. A great many rating schedules call 
for judgments of the persons being rated in some set of categories 
such as 

Outstanding, above average, average, below average, unsatisfactory. 
Superior, good, fair, poor. 

Best, good, average, fair, poor. 

Outstanding, superior, better than satisfactory, satisfactory, unsatis- 
factory. 

Superior, excellent, very good, good, satisfactory', unsatisfactory'. 

But how good is “good”? Is a person who is “good” in “judgment" 
in the top tenth of the group with whom he is being compared? The 
lop quarter? The top half? Or is he just nor one of the bottom tenth? 
And what is the group with whom he is supposed to be compared? Is 
it all men of his age? All employees of the company? All men in 
his particular job? All men in his job with his length of experience? 
If the last, how is the rater supposed to know the level of judgment that 
is typical for men in a particular job with a particular level of experi- 
ence? 

The problem that all these questions are pointing up is that of form- 
ing a standard against which to appraise a given ratee. Variations in 
interpretation of terms and labels, variations in definition of the refer- 
ence population, and variations in experience with the members of that 
background population all contribute to variability from rater to rater 
in their standards of rating. The phenomenon is a familiar one in 
academic grading practices. Practically every school that has studied 
the problem has found enormous variations among faculty members 
in the per cent of A’s, B’s, and Cs that they give. The same situation 
holds for any set of categories, numbers, letters, or adjectives, that 
may be used. Standards of interpretation are highly subjective and 
vary widely from one rater to another. One man's “outstanding” is 
another man’s “satisfactory." 

Raters differ not only in the level of ratings that they assign, but also 
in how much they spread out their ratings. Some raters are con- 
servative, and rarely rate anyone very high or very low; others tend 
to go to extremes. This difference in variability of ratings serves also 
to reduce the comparability of ratings from one rater to another. 



PSOaiEMS JM OSrAfMWG SOUND RATINGS 341 

^ Speviftc Ralfr Ji}ips}ncrafies. Not only do raters differ in general 

toughness or "softness '• They also differ in a host of specific idio- 

syncrustes. The experiences of life have built up In each of us an 
assortment of likes and dislikes and an assortment of individualized 
interpretations of the characteristics of people. You may distrust any- 
one who does not look at you nhifc he is la/king to 3-00. Your neich- 
hor may consider any man a sissy who has a voice pitched higher than 
usual. Your boss m.iy consider that a firm handshake is the guarantee 
of a strong character. Your golf partner may be convinced that blonds 
are flighty. These are rather definite reactions that may be explicit 
and clearly verbalized by the person m question But there arc myriad 
other more vague and less tangible biases that sse carry with us and 
that influence our ratings. These biases help to form our impression 
of a person and color all aspects of our reaction to him. They enter 
into our ratings too. In some cases, our rating of one or t\so traits 
may be affected. But often the bias is one of genera! liking for or 
nverslon to the person, and this generalized reaction colors all our 
specific ratings. Thus, the ratings reflect not only the general subjeo- 
llve rating standard of the rater, but also bis specific biases with respect 
to the person being rated. 


THf OUTCOMf OP PACTOM UMITINC RAiJNC EfriCTIVENfSS 


What is the net result of these factors alTecting the raters’ vsillingness 
to rate conscientiously and ability to rate accurately? The eHecis show- 
up in certain pervasive distortions of the ratings, in relathely low re- 
liabilities, and in doubt as to the basic validity of rating procedures. 

T/ie Cenerosuy Error. We have pointed out that the rater is often 
as much committed to the people he is rating as be is to the agency 
for which ratings arc being prepared. Over and above this, there 
seems to be n widespread unwillingness on the part of raters to damn 
a fellow man with a low rating. The net result is that ratings tend 
quite generally to pile up at the high end of any scale. The unspoken 
philosophy of the rater seems to be "one man is as good as the next, 
if not a little bcticr," so that “average*’ becomes in practice not the 
mid-point of a set of ratings but near the lower end 0/ the group. One 
finds quite generally the paradox of a great majority of the group being 


rated above average. , 

If the generosity error operated uniformly for all raters, it would 
not be particularly disturbing. We would merely have to remember 
that ratings cannot be interpreted in terms of iheir verbal labels and 
that “average” means “low** and “«ry good” 

Makers of rating scales have countered this humane tendencj 



362 


THE fNDIViOUAL AS OTHERS SEE HIM 

some success by having several steps on their scale on the plus side 
of average, so that there is room for differentiation without having 
to get disagreeable and call a person “average.” 

It is differences between raters in the degree of their "generosity 
error” that are more troublesome. To correct for such differences is 
a good deal more of a problem. We shall consider presently some 
special techniques that have been developed for that purpose. 

The Halo Error. Limitations in our experience with the person 
being rated, lack of opportunity to observe the specific qualities that 
are called for in the rating instrument, and the influence of personal 
biases that affect our general liking for the person all conspire to pro- 
duce another type of error in our ratings. This is a tendency to rate 
in terms of over-all general impression without differentiating specific 
aspects, of allowing our total reaction to the person to color our judg- 
ment of each specific trait. This is called “halo.” 

We can illustrate halo by a set of data on embryo airplane command- 
ers in World War II. Students were rated by their instructors for such 
qualities as “eagerness,” “foresight,” “leadership,” “instrument flying," 
“formation flying,” “lead crew potentiality,” and “over-all value.” The 
correlation between two raters for the same attribute was, on the aver- 
age, about .60. This serves as a measure of the reliability of the ratings- 
We may speak of it as the between-raters reliability. The average cor- 
relation between diQerent attributes for the same rater was about .75. 
That is, the correlation between ratings of different qualities was higher 
than the reliability of the separate ratings. This consistency can only 
be accounted for by a general halo that made instructor A’s appraisal 
of student B much the same no matter what attribute was being rated. 

Of course, some relationship among desirable trails is to be ex- 
pected. We find correlation among different abilities when these are 
tested by objective tests and do not speak of the halo effect that pro- 
duces a correlation between verbal and mechanical ability. Just hoW 
much of the relationship between the different qualities on which we 
get ratings is genuine and how much of it is spurious halo is very hard 
to determine. That some of the relationship is due to inability to free 
oneself from general biases seems clear, however, from examples such 
as the one we have just given. 

Reliabiliiy of Ratings. Studies have shown repeatedly that the 
between-raters reliability of conventional rating procedures is low. 
SjTnonds,’- wTiting in 1931, summarized a number of studies and con- 
cluded that the correlation between the ratings given by two independ- 
ent raters for the conventional type of rating scale is about .55. There 
seems to be no good reason to change this conclusion after the lapse 



PfiOSlEMS IN OBTAINING SOUND RATINGS 3^3 

of years. When the iivo ratings arc uncontaminaicd; i.e., ihe raters 
have not talked over the persons to be rated, and where the usual type 
of numerical or graphic rating is used, the resulting appraisal shows 
only this very limited consistency from rater to rater. 

If it is possible to pool the ratings of a number of independent raters 
who knovy the persons being rated about equally well, reliability of 
the appraisal can be snbstaniialjy increased. Studies have shown 
that pooling ratings functions in the same way as lengthening a test, 
and that the Spcarman'Brown formula (p. 179) can legitimately be 
applied In estimating the reliability of pooled independent ratings. 
Thus, i/ the reJiabiJ/ty of one rater is represented by a correlation of 
.55, we have the following estimates for the reliabilliy of pooled 
ratings; 


2 raters .71 

3 raters .79 

5 raters .86 

1 0 raters .92 


Unfortunately, in many important practical situations it is impossible 
to get additional equally qualified raters. An elementary-school pupil 
has only one regular classroom teacher; a worker has only one Imme- 
diate supervisor. Adding on other raters who have limited acquaint- 
ance with the raiee may weaken rather than strengthen the ratings. 

Reliability data on some of the newer types of toting devices to be 
discussed presently appear somewhat more promising. These data will 
be presented as the methods are discussed. One of the gain.? from 
basing ratings on specific tangible behaviors will be, it is hoped, that 
the objectivity, and hence the reliability, of the judgments will be in- 
creased. 

Validity of Ratings. All the limiting and distorting factors that we 
have been considering make us doubtful about the validity of ratings. 
Rater biases and rater unreliability operate to lower validity. How- 
ever, it is usually very difficult lo make any statistical test of the validity 
of ratings. The very fact that we have fallen back on ratings usu-illy 
means that no better measure of the quality in question is available 
to us. There is usually nothing else against which we can test the 
ratings. 

In one context, the validity of ratings is axiomatic. If we are in- 
terested in appraising how a person is reacted to by other people, i e., 
whether a child is well liked by his classmates or a foreman by his 
work crew, ratings are the reactions of these other persons and are 
directly relevant to the point at issue. 



the individual as others see him 

When mf,„ss nrc being studied ns p— 
be obtained as to the accuracy 5^1, -,ng and for each 

This is something that must det=™.ne^n^each sc 
UTje of crilcrion that is being prcdicte . ctudies of the ratings 

the most valid available predictors is * ° ^ 5 Militar>’ 

n tmv' = “ '^'ints™' tSeaToLrs and by felio. cad^ 
rreS mora ^hilhi; .Uh Jter ratings of f^— 
than did any other aspect of the man s rccor 

lations urth ratings of effect.veness .n blv as elose 

^hout 50 This criterion IS again a rating, but It is P ' tn nther 

ra *e S “pay ofT- as sve are likely to get in this ^duattom In other 

situations, of course, ratings may turn out to 

Each type of situation must be studied for its ossm sake. 

IMPROVING THE EFFECTIVENESS OF RATINGS 

So far v.e have painted a rather ^ooroy picture of 
as desices for appraising personality. It is 

haaards and pitfalls in rating procedures m 

limitations, there are and ssill continue to be “ as a means 

which we will have to rely on the judgments of other people as a m 
of appraising our fellow men. The sincenty and 
Ual medical student, the social acceptability of a ^ 

the conscientiousness of a private secretary can probab^ onb 
evaluated through the judgment that someone makes of these 9“ 
in the individuals in question. What can be done, “ ""j "jip, 

the defects of rating procedures? We shall consider first ‘te . 
of the rating instrument and then the planning and conduct ot 
ralings. 


RUlHVJiEmS IN THE RATING INSTRUW^NT 

The usual raring instrument has two main components: (1) a 
of stimulus variables (the qualities to be rated) and (2) a pattern o 
response options (the ratings that can be given). In the sirnplest an 
most conventional raring forms, the stimulus variables consist of 
names and the response options consist of numerical or adjectival ca 
gories. Such a form was illustrated on p. 353. This type of for^ 
appears to encourage most of the shortcomings that we have been 
cussing in the preceding section. Consequently, many variations an 
refinements of format have been tried out in an attempt to overcome 
or at least minimize these shortcomings. The variations have manipu 



IMPROVING THE EFfEaJVENESS Of RATINGS 365 

la(cd Ihc stimulus vomWcs, the response options, or both. Some of 
the main variations are described below. 

Rff/NfMENT5 IN mSSNTING WE STIMULUS VARIABLES 

Bare trait names represent unsatisfactory stimuli for a rater for two 
reasons. In the first place, as we pointed out on p. 359, the words 
rncan different things to different people. The child who shows “ini- 
tiative'" to teacher A may show “insubordination*' to teacher B, whereas 
teacher B’s “good citizen” may seem to teacher A a “docile con- 
formist.” In the second place, the terms are quite abstract and far 
removed in many cases from the realm of observable behavior. Con- 
sider “adjustment,” for example. We do not observe a child's adjust- 
ment. We obscn'C a host of reactions to situations and people. Some 
of these reactions are perhaps symptomatic of poor adjustment. But 
the judgment about the child's adjustment is several steps removed 
from what we have a chance to observe. 

Workers with ratings have striven to get greater uniformity of mean- 
ing in the traits to be rated, and they have attempted to base the rat- 
ings more closely upon observable behavior, "njese attempts have 
modified the stimulus aspect of rating instruments in three ways. 

1, Tfoh Names Nave Been Defined. A phrase, sentence, or several 
sentences have been appended to each trait name to give it greater 
uniformity of meaning. Thus, we might have; 

Clikenship. Participaiion in school projecis. Willingness to do his 
share. Respomtbiliry for work and property. 

This represents a somewhat more objective and behavioral statement 
and should produce at least some more uniformity in meaning among 
a group of raters. However, we may doubt that a brief verbal defi- 
nition svilJ completely overcome the individual differences in meaning 
that different raters bring to the task. 

2. Trail Names Have Been Replaced by Several More Concrete and 
Limited Descriptive Phrases. Thus, the abstract and blanket term 
'‘citizenship’* might be broken down into the several components sug- 
gested above, i.c.: 

ParcicipatfoD in school projects. 

WjJlinsness to iJo hts share. 

Responsibility for completing work. 

Carefulness with school property. 

A iodgmem would now be culled for with respect to eerh of the more 
limited and more concrete aspects of pol»l behawor. 



366 THE INDIVIDUAI AS OTHERS SEE HIM 

3. Each Trait Name Has Been Replaced by a Substantial Number 
oj Descriptions of Specific Behaviors. This carries the move toward 
concreteness and specificity one step farther. Following out our anal- 
ysis of “citizenship,” we might replace it with a set of behaviors some- 
what as follows: 

a. Works well with other children in groups and committees. 

b. Brings materials to school. 

c. Does his work without complaining. 

d. Gets assr^ed work in on lime. 

e. Keeps desk and work area neat. 

f. Uses materials without w-asting. 

g. Works steadily, even when not watched. 

h. TVhen one task is done, finds other work to do. 

!. Takes care of school property. 

This list is still more tangible and specific. There should be relatively 
little opportunity, in each case, for ambiguity as to what it is that is 
being observed and reported on. 

The replacement of one genera! term with many specific behaviors 
gives promise of achieving more uniformity of meaning from one rater 
to another. It may also bring the ratings In closer touch with actual 
observations that have been made of the behavior of the Individual 
who is being appraised. NVhere the trail to be rated is one that the 
rater has really had no opportunity to observe, the attempt to replace 
the trail name with specific observable behaviors will often make this 
fact painfully apparent and will force the designer of the instrument 
to rethink the problem of relating his instrument to the observations 
that the rater has really had an opportunity to make. 

The gains that a list of specific behaviors achieves in uniformity of 
meaning and concreteness of behavior judged are not without cost. 
The cost lies in the greatly increased length and complexity of the rat- 
ing instrument. There are limits to the number of different judgments 
that can be asked of a rater. Furthermore, the lengthy, analytical re- 
port of behavior may be confusing to the person who tries to use and 
interpret it. "The lengthy list of specific behaviors will probably prove 
most cJTective when ( 1 ) judgments are in very simple terms, such as 
simply present-absent and (2) there are provisions for organizing and 
summarizing the specific judgments into one or more scores for broad 
areas. 

REflNEMfNTS IN fOKM Of ftfSPONSS CATEGOKIES 

Expressing judgments about a ratcc by selecting some one of a set 
of numbers, letters, or adjectives is still common on school report cards 



IMPROVING THE EFfECTJVENESS OF RATINGS 367 

or in civil service and industrial mcni rating systems. However, these 
procedures have little other than simplicity to commend them. As we 
saw on p. 360, the categories are arbitrary and imdeancd. No two 
raters interpret them in exactly the same way. A rating of "superior" 
may be given to 5 per cent of employees by one supervisor and to 25 
per cent by another. One man’s A is another man's B. Subjective 
standards reign supreme. 

Various attempts have been made to manipulate the response op- 
tions to try to achieve a more meaningful scale or greater uniformity 
from rater to rater. 

1. Percentage of Group. To try to produce greater uniformity from 
rater to rater and to produce greater discrimination among the ratings 
given by a particular rater, judgments are sometimes called for in terms 
of percentage of a particular defined group. Thus, the professor rating 
an applicant for a fellowship is instructed to rate each candidate ac- 
cording to the following scale: 


Falls in the top 3 per cent of students at his level of training. 

In top 10 per cent, hut not in top 3 per cent. 

In top 25 per cent, but not in top 10 per cent. 

In top half, but not in top 25 per cent. 

In lower half of students at his level of training. 

Presumably, the specified percentages of a defined group provide a 
uniform standard of rjuality for different raters. However, the strata- 
gem is usually only partially successful. Individual differences \n gen- 
erosity are not that easily suppressed. 

2. Graphic Scale. A second variation is more a matter of form 
than clarity of definition. Rating scales are often prepared so that 
judgments may be recorded as a check at some appropriate point 
on a line, instead of fay choosing a number, letter, or adjective. For 
example: 

Responsibility for 

Conipiei'ing Work high 

The pattern often makes a fairly attractive page layout, is cowpact 
and economical of space, and seems somewhat less forbidding than a 
form which is all print. However, ibU particular variation does not 
seem to have much advantage other tfian attractiveness and con- 


venience. . , 

3. Behavioral Statement. We have seen that the stimuli may be 
in the form of relatively precise behavioral statements. Statements 



the INDIVIOUAl AS OTHERS SEE HIM 

Of this sort may also be used to present the ehoice alternatives. Thus, 
we may have an item of this type. 

Participation in School Projects ] 1 — 

uorks overtime. 


works overtime. , 

In this case, three statements describing behavior are 
a graphic scale, and are used to define three ™ uniformity 

descriptions may be expected to lend more “ provisions 

of meaning to the scale steps. Howe^r h«e 
do not completely overcome rater idiosyncrasies. 

'"T mLo-Muu Sceler. An early attempt to 7 ^" 
of meaning into the response scale. ■ 'l' . ,0 represent the 

men instead of numbers, adjectives, or has known 

scale points. The rater is asked to think of someone ^ 

well who was very high on the quality being point 

name is then entered on the rating form to define the ve^ Svn 
on the scale. In the same way, the names of other _^«on5 _ „ 

by the rater are entered in spaces to define “hi^, 
and "very low.” The five names then define levels for the trm . W 
a person is to be rated, the rater is instructed to rompare him th^^^ 
five persons defining the levels on the trait. The rater is J 
which man he most closely resembles on the trait m questio • 
assigned the value corresponding to the step on the scale w 


man occupies. ..,e.»pness 

It was thought that the man-to-man feature would lend concrete ^ 
to the comparisons and overcome the tendency of some ratere 
consistently generous. In cases in which all raters have ® ® ^ 

of acquaintance, so that their scale persons may be expected 
fairly comparable, the procedure may make for more uniformi^’ 
rater to rater. But such scope of acquaintance and ihoroughnes 
familiarity with suitable scale persons is likely to be somewhat 
in the practical situations in which ratings must be made. 
comparison with other persons is involved in any rating enterprise, 
explicit use of particular persons to define the steps on a rating sea 
has not been widely adopted. . , 

5, Pre\eni — Absent. NNTicn a large number of specific behavior 
statements arc used as the stimuli, the response that is called for is 



IMPROVING THE EFFECTIVENESS OF RATINGS 359 

oflcti a mere checking of those that apply to the individual in ques- 
lion The person is then characterized by the statements that are 
check-ed as rcpresenlins him, "nie rating scale becomes a behavior 
Check list. The set of items on p. 366 might constitute part of such 
a check list. 


If this type of appraisal procedure is 10 yield a score, the statements 
must be scaled or assigned score values in some way. The simplest 
Way is merely to score them +1, —I, or 0, depending upon whether 
they are favorable, unfavorable, or neutral with respect to a particular 
attribute (i.e., perseverance, integrity, reliability, etc.) ora particular 
criterion (i.e., success in academic work, success on a job, responsive- 
ness to therapy, etc.). An individual's score can then be the sum of 
the scores for the items checked for him. 

If the additional elegance seems justihed, more refined scaling pro- 
cedures can be applied to the statements- Scale values can be based 
on their judged significance or ibe degree to which they had actually 
discriminated between successful and unsuccessful individuals. The 


score an individual receives is then based on an averaging of the scale 
values of the items that were checked at describing him. The reliabil- 
ity of such a check list of scaled items has been found to be quite 
satisfactory in some instances, Richardson and Kuder’* reporting a 
correlation of .83 between two independent raters of groups of sales- 


men. 


Only limited use has been made of check lists as devices to yield 
scores on each individual, but they seem to present a promising pat- 
tern. They come the closest of any of the rating procedures to self- 
report inventories on the one hand and to ability tests on the other. 
A behavior check list is in a sense a personality inventory that has 
been filled out by someone other than the person being described. 
The items can be selected and scored in much the same way. T^e 
resemblance to an ability test can be seen in one v^ll-known behavior 
check list, the Vineland Social Maiuriiy Scale.* This check list is made 
up of items relating to self-help, self-directkm, communication, sociali- 
zafton, and the like. Sdecied Hems from different levels of the scale 


are shown in Table 13.1. . 

Norms for the scale were established for each Jtem, r^rcsentmg 
the age at which the behavior appears on the average. The cheefc 
list is ailed out by a rater ivbo knows Ibe chad being appraised. Items 
the person does or can do arc checked. A basal age is 
which all items arc positive, and the person being rated « 
given credit for all earlier items. Points are given for ■‘‘‘i f “ 

passed. The table of norms gives developmental age equivalents for 



370 

Table 13.1. 


THE INDIVIDUAL AS OTHERS SEE HIM 
|,e.s Sdecled from .he Vineland Social Malerily Scale 


Item No. 

Age Level 
(in years) 

1 

0-1 

6 

0-1 

11 

0-1 

15 

0-1 

19 

1-2 

28 

1-2 

34 

1-2 

37 

2-3 

40 

2-3 

44 

2-3 

51 

4-5 

53 

4-5 

68 

7-8 

70 

7-8 

78 

10-11 

80 

10-U 


Item 


“Crow-s,” laughs 
Reaches for nearby objects 
Drinlcs from cup assisted 
Stands alone 

Marks with pencil or crayon 

Eats with spoon 

Talks In short sentences 

Removes coat or dress 
Dries own hands 
Relates experiences 

Cares for self at toilet 

Goes about neighborhood unattended 

Disavows literal Santa Claus 
Combs or brushes hair 

Writes occasional short letters 
Does small remunerative work 


the point scores, and a developmental quotient may be computed that 
indicates the individual’s rate of progress toward self-sufficiency an 

independence. . jnctni- 

The check-list pattern has been used as a simple descnplive ms 
ment, as in school reports to the home. The procedure is , 

in this setting because it can ^ve information on specific aspec 
pupil development. However, forms tend to become complicate a 
to confuse many parents, so this type of reporting has not been wi e > 

adopted. . 

6. Frequency of Occurrence, or Typicality. Instead of reacli _ ^ 
an all-or-nonc fashion to an item, as in the check list, response ca^ 
be qualified as being “always,” “usually.” “sometimes.” “seldom, o 
“never” characteristic of the ratec. Or the ratee may be 
as “very much like,” “a good deal like,” “somc'Ahai like. sl'S * 
like,” or “not at all like” the behavior described in the staternen • 
(The terms of frcqucnc)- or resemblance may vaD’; the ones 2*'*^*’ 
only suggestive.) An individual's score would now take account o 



IMPROVING THE EFFEOIVENESS OP RATINGS 37] 

or the Significance of the statement and the point on the scale that was 
Cheeked. That is. an important attribctc would receive heavier credit 
than a minor one, and a chccl at the -‘always” step more credit than 
a check at “usually.” 

Indefinite designations of frequency or degree of the sort that are 
being discussed here wifi be differently interpreted by different raters, 
so the old problem of differences in rater standards is still with us. 
Moreover, when the number of specific behaviors being checked is sub- 
stantial, a simple present-absent checking correlates quite hlchly with 
the more elaborate form. 

7. /fanAing. In those cases in which each rater knows a substantial 
number of ratccs, he may be asked to place them in rank order with 
respect to each attribute being studied. TTius, a teacher may be asked 
to indicate the child who is most outswnding for contributing to the 
class projects and activities '“over and beyond the call of duty,” the 
one who is second, and so on. Usually, the ranker will be instructed 
to start at both ends and work in toward the middle, since the extreme 
cases are usually caster to discriminate than the large group of average 
onc.s in the middle. In order to case the task of the ranker, tie ranks 
may be permitted. If no tie ranks are permitted, the ranker may feel 
that the task is an unreasonable one, especially in a group of some size. 

Ranking is an arduous task for the ranker, but it does achieve hvo 
important objectives. It forces the person doing the evaluation to 
make discriminations among those being evaluated. The ranker can- 
not place all or most of the persons being judged in a single category, 
as may happen with other reporting systems. Secondly, it washes out 
individual differences among raters in generosity or leniency. No 
matiec how' kindly the ranker may feel, be must put .somebody last, 
and no matter how hardbollcd he is, someone must come first. In- 
dividual differences m standards of judgment are eliminated from the 


final score. 

If scores based on ranking by different judges are to be combined, 
there Is one assumption that Is introduced in rankings that may be 
about as troublesome as the individual differences in judging standards 
that have been eliminated, if we arc to treat rankings by different 
judees as comparable scores, we must assume that the quality of the 
group ranked by each was the same. That is. we assume that being 
second in a group of twenty represents the same level on the Iran 
being appraised, whichever group of tw-eaty it happened to be. sua y 
we do not have any direct way of comparing the different subgroups, 
so about all wc can do is assume that they are comparable. If he 
groups are fairly sizable and chosen more or less at random from the 



2^2 the individual as others see him 

same sort of population, this J';,!“^:;"r‘’ussumption 

scores based on ranks. , mean- 

Ranks as such do not represent a very „f ,hree 

tag depends upon the size of the group: bemg thtrd n a & J 
is very different from being thtrd in a «' in 

steps of rank do not represent equal ™.ts of a ra t As 
our discussion of percentile norms (Chapter 6), 
shaped distribution, one or two ranks at the extrem g P 

rLent much more of a difference than the same number of 
the middle of the group. For that reason, it P 

eonvert ranks into normalized ,he size of the 

of score that has uniform meaning without ^ yes 

group and uniform units throughout the score '“"f ’ „ps 

Lve been prepared to facilitate this j gy^monds 

of all sizes up to twenty-five may be found on pp. 9D-b- > 

(ref. 17). 

THE "fOSCEO-CHOICE" PATrERN 

All the variations considered so far operated on the same asi 
pattern. The rater considered one attribute at a time and assig 
the ratee to one of a set of categories or placed him relative o 
on that particular attribute. We shall now consider a major deparm 
from that pattern. The essence of the procedure we consider no 
that the rater considers a sel of attributes at one time and decides 
one (or ones) most accurately represents the person *’^‘"8 / 

Thus, an instrument developed for evaluating Air Force tec ni 
school instructors ‘ included sets of items such as the following: 

a. Patient with slow learners. 

b. Lectures with confidence. 

c. Keeps interest and attention of class. 

d. Acquaints classes w'ith objective for each lesson. 

The rater’s assignment was to pick out the two items from the set that 
were most descriptive of the person being rated. 

Note that all the statements in the above set arc nice things to say 
about an instructor. As a matter of fact, they were carefully matche , 
on the basis of information from a preliminary investigation, to c 
just about equally nice to say about an instructor. But they dii 
a pood deal, again based on preliminary investigations, in the extent 
to which they actually distinguish between persons who have 
identified on other evidence as being good and poor instructors. The 



IMPROVING THE EFFECnVENESS OF RATINGS 373 

ntost discrimmaiing siaicmcnt is (a) and the least discrimmaiing is 
5^^' « 2 to statement (a). ] to 

(cj and (d). and 0 lo (b). A pcKon's score for the set would be 
the sum of the credits for ibe two items marked as most descriptive 
of him. His score for the whole instoimcnt would be the sum of his 
worcs for 25 or 30 such blocks of four statements. Such a score was 
found 10 have good splii-half rehaWlity (.85 to .90), so that this in- 
slrumcnt provided a reliable score for the individuaPs desirability as 
an instructor in the eyes of a single rater. This does not, of course, 
tell anything about the agreement that would be found between dif- 
ferent raters. 

By casting the evaluation instrument into a forced-choice format, 
the maker hopes to accomplish three things: 


1. He hopes to eliminate variation in rater standards of generosity 
or kindliness. Since the items in a set are all equally favorable things 
to say about a person, the kindly soul should have no particular tend- 
ency to choose one rather than another, and the true nature of the 
ratec should be the controlling factor. 

2. He hopes to minimize the possibility of a rater intentionally 
biasing the score, fn the ordinary rating scale, the rater is in pretty 
complete control of the situation. He can rate a man up or down as 
he pleases. In the forced-choice type of instrument, It Is hoped that 
the rater will be unable to identify which arc the significant choices 
and that therefore he will be unable to throw the score one way or 
the other at will. However, though there are some indications that a 
forced-choice instrument is less fakeable than an ordinary rating scale, 
it is still far from tamper-proof in the hands of a detenruned rater. 

3. He hopes to produce a better spread of scores and a more nearly 
normal distribution of ratings. By making all options equally attrac- 
tive, one minimizes the cITect of the generosity error, it is hoped, and 
gets a more symmetrical spread of scores. Again, there is indication 
that this result is achieved at least in part 


Forced-choice rating inslrurocnts are a relatively new development, 
dating from World War 11, though the forced selection of one of a 
set of alternates bad been used before that time in self-report Inven- 
tories. The close similarity in the pattern of these forced-choice rat- 
ings to self-report instruments such as the Kiider Preference Record 
and the Edwards Personal Preference Schedule should be apparent. 
Because of the relative novelty of the forced-choice pattern, evaluation 
of its usefulness in merit rating procedures and m personality appraisal 
is stiU incomplete. This format does appear to get away ^om some 
of the most troublesome limitations of conventional rating procedures. 



the individual as others see him 

However, it has some of theVdgments that 

crce rater resistance, “ “f t^n d« 

the rater is calied upon “ ‘ instrument has a good 

“Is this worker more stupid or mote lazy., flavor. And 

deal of the “Have you stopped beating your wi y^.^ i",j.i,igent or 
even the judgment as to often seems to be no 

more industrious is not easy t° make n 

basis for comparing two quite dtn'=mny™«- ™ ^^=°^„nit ,abel or 
from this type of instrument <‘“1."“, f ^^^.".^^o^'good predictor 
: Utue ..p funding a de- 
scriptive picture and an understanding of the •"‘*•'''‘>“0^ ^ 

Developmental and exploratory work with forced- 5 

struments continues. For example, a !"7„P„fp“rformance 

Standard Oil Company of New Jersey as a Manag ^ 

Report combines forced-choice with numerical rating, 
items would appear as follows: 

Fits poorly Fi‘s well 


Follows work schedule closely 
Has good work habits 
Is a credit to his department 
Makes decisions promptly 


0 1 
0 1 
0 1 
0 I 


2 3 4 5 
2 3 4 5 
2 3 4 5 
2 3 4 5 


6 7 8 9 
6 7 8 9 
6 7 8 9 
6 7 8 9 


The numerical scale runs from a low of 0 to a high of 9. 'The 
may use any part of the scale, with the one restriction that he may 
not use the same scale point for two statements. Thus, he can ra 
a man relatively low on all or relatively high on all. Tliis takes so 
of the onus out of the forced ranking so far as the rater is conccr^c . 
In using the results, wc may treat them cither as conventional rating . 
paying attention to the level checked, or as pure forccd-choicc ran - 
ings. ignoring the numerical values completely. 


RfflNCMENTS IN THE RATING PROCEOURES 

The best-designed instrument cannot give good results if used under 
unsatisfactory rating conditions. Raters cannot give information t cy 
do not base and cannot be made to give information they arc unwii m? 
to gisc. Wc must, therefore, try to pick raters svho have had c osc 
contacts with the ratccs and ask them for judgments on attributes ticy 
have had an opportunity to observe. Wc should give them some gui 
ance and training in the tjpc of judgments wc expect them to ma c, 
and if possible the>' should have opportunity to observe the ratccs ajif^ 



IMPROVING THE EFFECTIVENESS Of RATINGS 375 

they have been cdticatc<I in the use of the ratings. When there are 

n'thcidT^ ° who know the m«, „(i„Es should be 

gathered from alt of them and pooled. Evcjy effort should be made 
o motivate the raters to do an honest and conscientious job. Let 
US consider ihcse points furthw. 

Selection of Raters. For roost purposes, the Ideal rater is the per- 
son who has had a great deal of opportunity to obsen-e the person 
being rated in sltu.itions in which he svould be likely to show the 
qualities on which ratings arc desired (Occasionally it may be de- 
sirable to get a rating of the impression which a person makes on brief 
contact Of in a limited experimental situation. ) It is also desirable that 
the rater fake an imp.Trfial attitude toward the ratee. The desirsbWty 
of these two qu.alitics, thorough acquaintance and impartiality, is gen- 
erally rccognircd in the abstract. However, the goals may be only 
partially realized in practice. 

Administrative considerations usually dictate that the rating and 
evaluation function be assigned to the teacher in the school setting 
and to the supervisor in a work setting. The relationship here is in 
each case one of direct supervision. There is generally a continuing 
and fairly close personal relationship. But the relationship is a one- 
directional and partial one. TTie teacher or supervisor secs only one 
side of the pupil or worker, the side that is turned tosvard the “boss." 

Those qualities that a boss has a good chance to sec, primarily 
qualities of work performance, can probably be rated adequately by 
the teacher or supervisor. Thus, in one study ’ of airplane mechanics 
it was found that the ratines by a pair of supervisors on “Job know- 
how" were as reliable as the pooled ratings by eight coworkers in a 
plane maintenance crew and that (he supervisors’ pooled rating cor- 
related .53 with a written proficiency test, whereas the pooled rating 
for the coworkers correlated only .43. However, those qualities that 
show themselves primarily in relationships with peers or subordinates 
will probably be evaluated more soundly by those same peers and sub- 
ordinates. The vaMiy of the U. S. Militaiy Academy peer ratings 
described on p. 364 is a case in point. 

The lack of acrccment between supervisor and pupa fahogs of 
teachers is susgesred in some of the following correlations from differ- 
ent studies: 

PnpiTs rating of esecllcnee versos principafs rating = -35 

Pupil’s racing of esccllence versus composite of 5 judges • 

Mean pSliTrating of cirec.ivco« versus 
Student versus administrator rating oti gene a 

ness “ School I .50 

School ll 



the individual as others see him 

A certain amount of overlap 

to have a good deal of uniqueness. The birds eye 

"“hrsi" -he The P-ta^htr 

applicants for jobs or fellowships ^ supply 

point of view. In this setting, the applicant is ^ p^d 

a certain number of references or to j individuals 

out by a certain number of individuals. The ch 
is usually left up to him. and we may 7 satis- 

nersons he believes will rate htm favorably. It might . ^ .gses 

factory if the applicant were asked to supply the ^ ^ ^ ,d 

of persons who stood in particular relationships to h m a-d who s^ 
be able to supply relevant information, rather th . 

plicant free to pick his own endorsers. Thus, a l^. PP .„ his most 
L asked to give the names of his immediate supervi h ^ 

recent jobs; a fellowship applicant, to 7 more 

visor and of any instructors with whom he had . ssho 

courses. Thus, we are shifting the responsibility d'teroin 
shall provide the ratings from the applicant to "^plicant. 

a shift should reduce the amount of special P'?'^'."f '7 apply 

Se/ec/on of QuofUUs ,o Be Ra,ed. Two P'>"''P'«,777, ng pH> 
in determining the types of J” prace- 

cedures. In the first place, it seems undesirable to use rating p 
dures to get information that can be provided 
more objective and reliable indicator. Score on a super- 

intelligence test is a better indicator of intellectual a i y exist, 

visor’s rating of intellect. AVhen accurate production records 
they are to be preferred to a supervisor’s raung for producti y. 
ings are something to which we resort when we do not have J 
indicator available. _gs 

Secondly, we should limit ratings to relatively overt qualities, 
that can be expressed in terms of actual observable behavior. 
cannot expect the rater to look inside the ratee and tell us w a 
on within. Furthermore, we must bear in mind the extent an 
of the contact between rater and person rated. For example, 
of ratings to be used after a single interview should be limited^ o 
qualities that can be observed in an interview. The interview’ee s ” 
ness, composure, manner of speech, and fluency in answ'ering ques i^ 
are qualities that are observable in a single interview. His m ’ 
integrity, initiative, and ingenuity are not, though these 
be appraised with some accuracy by the person who has worke ^ 
him for a time. Ratings should be of observable behavior observa 
in the setting in which the man has been observed. 



IMPROVING THE EFFECTIVENESS OF RATINGS 5" 

Educational Prosram lor tor recording 

even with the proper raters of making good 

the ratings. Raters must be so . • Pointing out the 

ratings and taught how to use the rau g^ 
importance of “selling" a rating p ^ 

do it. As we have indicated competing mo- 

identmcation with the ratee on ih P hot 

Uves. We cannot provide a cour^ program for gathering 

a job of selling needs to be done thoughtfulness and 

ratings. Furthermore, the ’*"'"5 “ “"g 

inle^ty of the appraisals are to fx rn ttiting m- 

It is desirable that raters have pm .„^„„„cnt is used under 
strument. A training sessmn, -^g, of the attributes can be 

supervision, is often desira c, prepared, and the rcsu e 

discussed, sample rating ^ J y mror can be noted. anJ 

ratings reviewed. The pr=va.hngjn«T«J^^ ^ m 

raters cautioned to avoid it. , jitirfbolion of ratings, 

tempt to generate a more •"“J*„,n.i„gs of tatmgs, he « 
sessions wiil not common distortions cons 

should reduce somewhat the mor „,„„tines 

“pSr" Sc'rTmS^f. "^it'timT" 

away from this d=P“'!="“ ”1^ well in P hlfthe raters 

" - - UPE»n made to pi 


„g .0 the qualities that am^ 
empt has even been °/,j„c. However, t"” . ,he raUng 

ibservations over a pen -.^cnt to, and coope jraricc notice 
rails for a high level o pp-"”' “;,„m=nt is mting ptou' 

program. TVhere that level of ^ imp'“'j J' 

d systematic probaWy rare ho" of 

5. Situations of this s jfaters. ,her« arc i 


ss. Situations of ^ Raters. Ow of ^ nunr 

Pooling of ^^fmosc situations in 

atings is low reliability- approximately eq K 

,er of persons who have indepe"*"' - sp,gi=s have 

Z ra.ee, it may be tt compotac^^,„t raimP 

potential rater and >“ '„5nbnW »' I’*'."'! J The fot”i 


y be possible to g ^ ^pn,posite raiinf^ 

,d to pool these i pooling indcP« formula 

hown.1 ■« the 

5 essentially the sam 107) appb®^' ^ 

'i\’cn in Chapter 7 IP* 



the individual as others see him 

achieve any needed level ol reliability in our appraisal merely by in- 
creasing the number ol raters. to observe the ratee.” 

The catch is found in the phrase a person 

Unfortunately, the number 5° usually limited, 

in some particular setting, school, job, camp, etc., y 

Often ou^ one person has been in close conmct the rate= m 
parUcular relationship. He has had only one homeroom «^ej, oj^ 
one foreman, only one tent counselor. Others had 
Zh him. but it may be so much less that then judgments add 
to the judgment of the rater most intimately involved. 

Note that we specified the pooling of be inde- 

ratings are independently made, the error components 
pendent and wUl tend to cancel out. If, however, t e 
combined through some sort of conference or 

just what may happen. Errors may cancel out, ind'ependent 

the prejudices of the most dogmatic may prevail. 1 errors and 

judgments is the only sure way of balancing out indivi 
has been found in several studies «••• to be more satisfactory than 
conference type of procedure. 

NOMINATING TECHNIQUeS 

If a teacher is to understand pupils, he must have some 
of the values and standards that the group sets for its 
peer culture — and of the role that each child plays in the group 
Sntemporaries-the peer group. Tire standards and values of ^ 
peers provide the sanctions and the rewards that are ve^ m ^ 
in determining how a person will act and how content he wi 
the group setting. The peer group can be quite a j jyal 

such a group any action by a teacher with respect to an moi 
child is often viewed not only as an action for or against him u 
as an action for or against the group to which he belongs , 

identifies with him. Thus, in order both to understand the indivi 
and to understand how acts with respect to individuals affect the gr 
climate, it is important to appraise the role of the individual m 

It is far from easy for the teacher or other outsider to get an 
appraisal of group structure and of the place of the individua 
The child’s role is likely to be seen only from an adult point o vi^^^ 
and that adult viewpoint to be projected upon the group of his c 
temporaries. Thus, when a child is helpful, friendly, and 
acceptable to the teacher, the teacher is likely to attribute to that c 
a level of influence with other children that he does not have. 



279 


IMPROVING THE EFFEaiVCNESS OF RATINGS 

often ^cuh for ihe teacher to attribute to an active and troublesome 
child his true level of influence with his peers. Teachers arc often only 
dimly aware of the pattern of social intcnslay in their classroom, the 
reputation of each pupil among his peers, the factors determining pres- 
tige in the peer group, the patterns of attraction and repulsion, or the 
individual social aspirations. 

In the understanding of these relationships, peer ratings are often 
helpful. A rating procedure that is very simple and quite effective for 
obtaining appraisals by peers is the nominaiing technique. Wc will 
consider this technique first as applied to social choices and rejections 
and then as applied more generally to trait ratings. 

To improve their understanding of the social structure in a class- 
room, the patterns of friendship and leadership, teachers may use the 
simple expedient of asking pupils to name their choices of best friends 
or of work partners. For example, a teacher might say to a class; 
“For our unit on Mexico, we are going to need some committees of 
children who will work together on some part of the project. I would 
like to know which children you would like to have on a committee 
with you. Put your name on the top of the piece of paper I gave you. 
nten under it put the names of the children you would especially like 
to have on your committee.” 

We now have a series of nominations or choices for work partners. 
It is possible to show these choices pictorially by a diagram such as 
that shown in Fig. 13.1. This is called a sociogram and the procedure 
of constructing a sociogram is called joc/owe/O'. 

Procedures to help in the construction of sociograms can be found 
in Moreno * and in a booklet by the staff of the Horace hfann Lincoln 


Institute.* 

From the sociogram sho'vn in Fig. 13.1, we sec that A and B arc 
the most sought after members of the group: these arc the stars. 
Pupils J and O did not choose anyone and were not chosen by any 
other pupils: they are isolates. Pupils H and I chose c.icb other but 
were not chosen by any other pupils. Except for the mutual friend- 
ship between them, they too arc isolates. Pupils P, 0, M, and N arc 
fringers: they do not rcaUy belong to any of the groups but do make 


choices within (he group. . . . . ... 

Figure 13.1 .shows Ihe pattern of choices and attractions wtihm the 
group. It would also be possible to base children indicate those c ass 
members whom they would definitely not want in their proop, Caibns 
for rejections presents some slight risls to individual and class morale 
but does permit a more complete picture of group ilroelure. 

The siiogram in Hg. I3.I 



380 


the individual as others see him 



KD 


Choice ■' *" 

Mutual Choice-* *■ 

Fig. 13.1. So<i&g«<n cf founS^fodt clou. 


group. The rather large number of isolates and fringers and the Imt 
ages across from one “clique" to the other suggest an unstable pattero 
which is in the process of changing and reforming. Thus, the soci 
gram might represent a class at the beginning of the school year, 
which a residue of last year’s friendships is mixed with new curren 
and in which pupils from other class groups and other schools are no 
yet integrated into the group. It is in such a setting as this that e 
teacher can be most effective in bringing isolates into the group or 


promoting new friendships. . 

After the teacher has determined which children are without fnen s 
or are relatively isolated in the group, he should try to find out w v 
this is the case. Sometimes the explanation may be very simple. The 
child may be new to the group and have not yet had time to find ^ 
place in it. The normal opportunities to get acquainted, furthered > 
the teacher s efforts to bring out the new child's assets, may be all tha 
is required. The child may be older or younger than the rest of I ® 
group, having friends in other classes or outside of school. The cn 
may not live near any of the other children in the class. At other 
times, the reasons may be more subtle, and it may take a good dea 
of discreet sleuthing for the teacher to find out why Willie or Ahc® 
are not chosen by their classmates. 

When the reasons are understood, the teacher can often help to 


IMPROVING THE EfFECTIVENESS OF RATINGS 381 

‘a™', ““pfe PrPKKS of ooacWns itic child so 

that he develops competence in athletics may turn the trick The 
teacher can arrange seats so that a child is placed near one for whom 
he expressed preference. Sometimes helping a child to develop every- 
day social graces or to improve his personal appearance is ail that is 
needed to make him acceptable. If an isolate or fringer has special 
mechanicaJ or artistic skills, giving him an opportunity to use these 
in class group activities may be effective. 

In general, the teacher can help a child become integrated ivith and 
accepted by his peer group by (1) providing opportunity for develop- 
ing friendly relations, (2) improving social skills, and (3> building 
up a sense of accomplishment or competence. 

Sociometric choices describe the present flow of interaction among 
children rather than indicating any strong and permanent emotional 
structuring. However, the structuring of a class group affects the 
general emotional climate of the classroom. In a class where there 
are many isolates or children who are “fringers,” i.e., not completely 
accepted by a clique, the morale of the group tends to be low and 
group planning and coordinated group action is made more difficult. 

It Is also true that the teacher in dealing with one child Is quite fre> 
quently dealing with the clique to which the child belongs. 

Sociograms frequently point up mistakes that a teacher makes in 
characterizing a child. Thus, when the teacher has judged a child 
and his position in his peer group by adult standards, sociometric 
devices point out these mistakes and give the teacher a framework for 
understanding behavior that taken by itself may seem unexplainable. 

Sodograms have been used in various non-school situations. In 
industry they have been used to form work groups and have been 
found to stimulate production. They have been used in institutions, 
especially those for juvenile offenders, to select house groups. 

The sociogram by Itself tells the teacher only what children arc 
selected or rejected, not the reasons for selection and rejection- U 
is most useful when used in conjunction with good anecdotal records. 
For successful use, especiaffy when rejections are asked for, there 
needs to be a friendly feeling between the teacher and the class. Fur- 
thermore, the teacher should actually use the nominations as to as 
possible in the way In which he has told the class he would use them. 

The teacher should also remember that group structure is not stauc. 
especially in younger age groups. One sociogram made at the begin- 
ning of a school year will rarely provide an adequate picture of group 
structure throuch the year. Furthermore, neither choices nor rejec- 
tions can be taken entirely at face value, men. as is sometimes the 



232 the individual AS OTHERS SEE HIM 

procedure, the number of choices is limited to f 

and a desire to be associated with that prestige, ra e 

‘'’Tfind word of caution should be sounded about ““"f'"'"® “ 

soeiometrie data to reconstruct a group or ™ „ „aj. 

We have olfered some suggestions as to ways m which a t 

try to help the relatively isolated child. However any 

iSions call lor a good deal of subtlety. Heavy-handed atlempu by th^ 

teacher to manipulate the pupils in the group may only agg 

ais he is trying to cure. , . ,. wn developed, 

Other patterns for obtaining peer evaluations have been d _ F 
and they have been used tor other purposes beside the preparing 
sociograras and the studying of social currents within the group, 
slightly mote complex form is the Ohio Social Acceptance ’ 
which each pupil reacts to each other pupil in the group, c 
him under one of the following six categories: (1) My ' O 
best friends, (2) My other friends, (3) Not 

Don't know them, (5) Don’t cate for them, (6) Dislike P ,, 
the pooled pupil responses, a score may be obtained for each c 
indicating the extent of his acceptance within the group, "pis or so 
other similar format provides a simple procedure for obtaining ra i . 
by a group of peers, and their simplicity makes them usable ev 
with elementary school children. 

Nominations may be used at any age level, and may be made wi 
respect to any type of characteristic. For example, they have re 
quently been used in the armed services in Officer Candidate Schoo , 
where each member of a unit may be asked to nominate a speci e 
number of individuals in his unit who have shown the greatest evidence 
of “leadership” during the training course. He may also be as e 
to nominate those who have shown the least indication of leaderslnp- 
Taking all the nominations for the group as a whole, it is 
to arrive at a score for each individual, giving a plus for each favora 
nomination and a minus for each unfavorable nomination. _ 

A variation of the nominating procedure that has been used wit 
school children has usually been referred to as the “Guess Who tec 



SUAAMARY AND EVAtUATION 

nique or as “Casting Characters.” 7n this procedure, the children 
instructed somewhat as follows: 


383 

are 


Suppose sve litre going lo pot on a class play. Uie characters in 
the play are described below For each charaeter, )’{)u are to put 
do\\m the names of one or more children in the class who would be 
good for that part because be or she is just hkc that anyway. 


“This person is always cheerful and happy— never grouchy or crois. 
“This person Is always butting m and (cllmg other people how to do 
things. He cannot mind his own business 
“This person is very quiet and doesn't get into games or do things 
with other children.” 


The number of characters can be extended as desired. Each “char- 
acter” is a description in fairly concrete terms of a quality of behavior 
in which the investigator is interested Descriptions of opposite ends 
of a scale can be included — i.e., friendly versus unfriendly, dominat- 
ing versus submissive, etc. — and can be treated as positive and nega- 
tive nominations on a single scale. Each child receives a score for 
each "character,” based on the number of nominations he receives. 

The attractive feature of the nominating pattern is its simplicity, 
which makes it rather painless to administer and usable with young 
groups or groups with little sophistication or experience in rating. It 
is feasible because the large number of raters make It possible to use 
a simple count of nominations instead of a rating of the usual type. 


SUMMARY AND EVAtUATION 

In spite of all their limitations, evaluations of persons through rat- 
ings wil) undoubtedly continue to be widely used for administrative 
evaluations in schools, civil service, and industry, as well as In educa- 
tional and psychological research. We must recognize this fact and 
learn to live with it. Granting that we shall continue to use ratings 
of different aspects of personality, we should do so with full awareness 
of the limitations of our instruments, and we should do so m such a 
way that these limitations are minimized. 

The limitations of rating procedures arise out of; 

1. A humane unwillingness lo make unfavorable judgments of our 
feliows. which is particularly pronounced when wc identify to some 
extent witij the person being rat«l (generosity error). 

2. Wide individual differences among raters in “humanen^ or. 
in any event, in leniency or severity of rating (differences in rater 
standards). 



384 THE INDIVJDUAl AS OTHERS SEE H)M 

3. A tendency to respond to other persons as a whole in terms of 
our general liking or avereion and difficulty in differentiating out spe- 
cific aspects of the individual personality (halo error). 

4. Limited contact between the rater and person being rated — 
limited both in amount and in type of situation in which seen. 

5. Ambiguity in meaning of the attributes to be appraised. 

6. The covert and unobservable nature of many of the inner aspects 
of personality dynamics. 

7. Instability and unreliability of human judgment. 

In view of these limitations it is suggested that ratings will provide 
a most accurate portrayal of the person being rated when: 

1. Appraisal is limited to those qualities that appear overtly in 
interpersonal relations. 

2. The qualities to be appraised are analyzed into concrete and 
relatively specific aspects of behavior, and judgments arc made of 
these behaviors. 

3. A rating form is developed that forces the rater to discriminate 
and/or that has controls for rater differences In judging standards. 

4. Raters are used v.ho have had the most opportunity to observe 
the individual in situations in which he would display the qualities to 
be rated. 

5. Raters are “sold” on the value of the ratings and trained in the 
use of the rating instrument. 

6. Independent ratings of several raters are pooled when there are 
several persons qualified to carry out ratings. 

Evaluation procedures in which the significance of his ratings is 
somewhat concealed from the rater present an interesting possibility 
for civil service and industrial use. This is true particularly when 
controls on rater bias are introduced through “forced-choice” tech- 
niques or a correction score. 

Peer-nominating techniques have interesting possibilities for use in 
schools and other group settings. They permit sociometric analyses 
of the interpersonal relations of pupils in a classroom or the workers 
in a shop. “Guess Who” nominations permit a simple type of rating 
in the early grades. 


REFERENCES 

1 . Brookover, W, B., Person-person interaction between teachers and pu- 
pils and teaching effectiveness. J. educ. Res., 34, 1940, 272-287. 

2. Cook, \V.. and C. H. Leeds. Measuring the leaching personality. Educ. 
psycho!. Mens.. 7, 1947. 399-410. 



the individual as others see him 

Guilford, J. P.. Psychometric methods, 2nd ed., New York, MoGrawHil , 


questions for discussion 

1 . If you were writing to someone “J® 
r if refu. evaluation of the 

bv different users? What is your tmpression of the rel.abiltty 

ingsl Of their freedom from halo and ?*=' , conscientiously? 

® 4 What factors influence a rater's wtlltngness to rate con 

How serious is this issue? What can „,ers ordinarily 

5 Why would three mdependent rattngs from separa mcelher as 

be preferable to a rating prepared by the three persons work.ng togetner 

° fn the personnel office of a large company, 

are called upon to rate job appheants at the '"jJ "j peasonably 

of the following characteristics would you expect to be ra 
reliably? Why? 


a. Initiative. 

b. Appearance. 

c. Work background. 

d. Dependability. 

e. Emotional balance. 


7. In a small survey of the report cards used in a number of 
ties the following four traits were most frequently mentioned as lo 
the report cards: (a) courteous, (b) cooperative, (c) health hamis. y 
works with others. How might these be broken down or revised 
the classroom teacher could evaluate them better? n-rson 

8. Which of the following would influence your judgment oi a p 
in an interview? In what way? 


a. A very firm grip in shaking hands. 

b. Wearing a “loud” necktie. 

c. Generally pausing for a moment before replying to a question. 

d. Playing with keys on a key ring. 

e. Having a spot on his vest. 

f. Looking at the floor all during the interview. 



QUESTIONS FOR DISCUSSION 387 

9. Compare the reactions of several class members or of sescral ac- 
quaintances on the items of question S. How general arc the reactions? 
\Vhal basis in fact is there for them? 

10. What advantages do ratings by peers have over ratings by superiors? 
What disadvantages? 

11. What are the advantages of ranLing over rating on a rating scale? 
SVhat are the disadvantages? 

12. Suppose that a forced-choice rating scale had been developed for use 
in rating the teachers m a city school system ia order to get an evaluation 
of their effectiveness. What advantages would this rating procedure have 
over other types of ratings? What problems would be liLely to arise in 
using it? 

13. Make up a “Guess Who” form that might be useful to a teacher in 
finding out about the pupils in h»$ class. If a class group is available to 
you, try the form out and analyze the results. What precautions should be 
taken in using the results? 

14. Using a class group taught by some ebss member or made available 
by the Instructor, get each child's choices for other children to work on a 
committee wriih him. Plot the results in a sociogram. ^Vhat do the results 
tell you about the class and the pupils in it? What limitations would this 
sociogram have for judging the status of an individual child among his 
classmates? 

15. Suppose you have been placed in charge of a merit rating plan wnrch 
is being introduced in some company. What steps would you take to try 
to get as good ratings as possble? 



Chapter 14 

▼ 

Beliavioral Measures of 
Personality 


We heve tended to define pe^nnl.ty “ to the 

individual’s behavior. It would be natural, ® Two 

behavior of the ’individual to get designed 

possib’ilite are available to us. We scored or 

"test” situations, in which the individuals behavior may b 

rS, or we r^ay plan to observers behavi. 

ously in his natural environment Each of *e e bru r 0 
from psycholopsu and educators, and we shall consider eacn 

BEHAVIOR TESTS 

In personalily testing we are concerned with *e 
e( die individual-what he will do under the ordinary 
life, rather than what he can do if he is trying to do h'S 
these circumstances, it is obvious that any test must usually be 
and disguised, so that the examinee does not wha » 

appraised. This appears especially clearly in the field of c 

testing. . •_«, cf»t^ uD 

Traits of character relate to behaviors m which se f 

definitions of what is “good" and what is “bad.” We can hardly ® P 
a child to report his dishonesties, for example, or to show them 
test situation in which he knows his honesty is being observ’ca 
appraised. Furthermore, he has probably managed to conceal m^^^ 
of his transgressions from teacher, camp counselor, or other a 
who might be asked to rale him. We are almost forced back 
a concealed test to elicit such socially disapproved behavior. ^ 
shall describe in some detail the honesty tests devised by May 
Harlshome for the Character Education Inquiry,* in part for t « 
intrinsic interest and in part because they illustrate the virtues an 
many of the limitations of this type of measurement procedure. 



BEHAVIOR TESTS 


339 

May and Hartshorne developed a comprehensive series of tests of 
honesty. These included situations in which the individual had a 
chance to cheat, situations in which he had an opportunity to lie, and 
situations in which it was possible for him to steal. Some of the situa- 
tions arc described below. 

Situation A: Cheating on a test by copying. A test is given dealing 
with some topic related to school work, word knowledge, for example. 
The papers arc collected. The next day the papers arc passed out, 
and each pupil is allowed to score bis own paper when the answers 
are read aloud. As a matter of fact, however, the papers have been 
accurately scored before they are returned without any marks being 
made on the paper. The amount that the pupil copies in and scores 
his own paper above the correct score is used as an indication of 
cheating. 

Situation B: Cheating on a test by adding on. A speeded arithmetic 
test Is given, and at the end of 2 or 3 minutes pupils are told to stop 
work. However, for several minutes papers are left on their desks 
while the teacher or tost administrator is busy doing something else. 
Later a second test Is given after which the papers are immediately 
collected. When performance on the first testing surpasses perform- 
ance on the second test by a specified amount, this is taken as evidence 
that the examinee added onto his work after the time limit was up and 
before the papers were collected. 

Situation C: Cheating in a game^peeking. The game is illustrated 
in Fig. 14.1. The stunt is to shut one's eyes and put a dot in each 




390 BEHAVIORAI MEASURES OF PERSONAllEY 

e.do in turn. Norms urn prepnrod. bnsod upon 

,hoir view blocked so U nssLed to 

unduly well, as determined by the peek prooi 

have peeked and helped himself. ..(jjij 

record blank. Since fatigue tends to set m ^ he re- 

is unlikely that he will show improvement. If the ^5. 

ports surpasses his practice squecas by a ; |orn,e„ce. 

sumed that he has been unduly optimistic in P^.y ^shed 

Siluallan E: Lying— i^ll-glorificalion. In this ^ ^ f hc- 

a series of questions. Each question has to do with standards 
havior that are universally applauded but 

one question reads “Do you always obey your parents chcertm y _ 

promptly?” and another, “Do you always fhis a child 

It is hard to know how many of 0 set of stalemenU Ii ,his 

might truthfully endorse, but an attempt was made '. j 

by having groups of graduate students think back to t eir 
and respond as would have been true of them them 
marks an excessive number of items is deemed to be not angci 

untruthful. ,«>spr of 

Situation F: Stealing. A game is devised which uses a num 
coins. These are in a box, and one box is passed out to each • 
After the game is over, each child is told to put the coins a 
the box and fasten it up. The boxes are collected. They have 
unobtrusively coded, so it is possible to tell which child had w 
box. A check of the coins in the boxes makes it possible to detemu ^ 
which children have helped themselves to one or more of the ' 
As can be seen from the brief descriptions, the tests are quite 
volved and require rather extensive stage-managing. The ° 

the testing situation seem fairly critical, i.e., how sure the child ee 
that he is free from observation, the manner in which the children ar ^ 
occupied when they are stopped in their work, and so forth. An i 
is crucial that the “security” of the test be maintained, for if the true 
purpose of the test were suspected, examinees could immediately con 
form to the approved sooal standard. 



BEHAVIOR TESTS 


391 


evaluation Of «EHAVIO« TESTS Of HONESTY 
How reliable and how valid are to hones 

Reliability estimates are ^ about .50. In 

reliabilities of smgle tests are r __i,ievcment Tests we have been 

comparison with the “P"'” ‘ tee reliabilities are disappoint- 

considering in *= f ^ rf ,he set used by May 
tag. The score of a pupil ^ tadication of the 

and Hartshorne ™ohl P'™ ^ test would need to be ex- 

SS ^bradd^.° “/rdt^mi-e^'toSnr Th: 

T - ^ ‘’’= "■ 

ferent groups of pupils. 


Table 14.1. 


Reliabilities of Te.NU»d for Meosuring Deception 

(From May and Hartshorne®) 


Type of Test 


1. copying ft- Vfto on"^ '«• 

I;^e^e7g"-sty.s*»ldb.- 

4. Faking a phraeal ability test 

rLy^ng^orap'^-r 

7. Getting iHW •"'? “ Pp"' 


Reliability 

Coefficient 

.70 

.44 

.46 

.50 

.46 

.84 

.24 


7 Getting illicit ne.p 

to it to find any outside 

pupils may be w ^ Ptel 

classroom cheating before we 1°°^ . . honesty 

terion (n«rag= bow the difeent k.nds 

one type correlated ^ P, classroom eheab^ tes 

extent of .26. “"''’VT.esrthT average cor- 
related with chcattn= u„d with the stealing p„, aged 

was found to be 'lying 5 the t'vo 0“'-”'"'“"' 

S"'ctaerc.assroomtcsuana.06w 

room tests. 


3,2 SEHAVIORAl MEASURES OF PERSONAUTY 

Even though the 

U„ns between the dmeren sort » ,„sus gymnn- 

the eorrelations .nvolve diHerent settings t . 

tests nre measuring diflerent specific factort „„ething about the 
and character traits), and u is of limited 

nattems) in the individual that have developed in respo^ ^ 
situation. Our characterization of a person in genera ,_,ci(ic 

only partially effective in predicting what he will do in a gi P 

“'so^far as personality testing is concerned, we ^ pfr- 

formance test behaves much like an »em on an =>>d''y ° ^ a 

«5onalitv inventory. TTius, in a sense, the test that permis a p f 
to revise his answer sheet as he scores it really asks a ^“,0^ 

in behavioral terras, to wit: Would you change the 
scored your own paper? Each honesty test asks a ^ i.em 

question, and collectively they might be thought of as a twenW 
questionnaire on honesty. The reliabUities of the separate te , 
their inlercorrelations are not so diRcrent from those that we hn 

the single items of a personality questionnaire. If we thmK 

single tests as items, we will appreciate better the relatively 
reliabilities and the quite considerable specificity that they show. 

The low correlations among specific honesty tests make it n^s 
to include a number of separate tests if we hope to get an 
representation of different honesties. Because of this fact, toge ^ 
with the complexities of testing procedure, the use of behavior 
of character has been limited largely to research projects. They av 
not been adapted to any extent for routine use in schools or for any 
type of personnel selection. 


SELECTED RESULTS FROM MAY4IARTSH0RNE STUDIES 
As research tools, behavior tests have provided a wealth of 
ing data, notably in the ori^al studies of May and Hartshome. Soto 



BEKAWOR TESTS 3P3 

of the more imerestios Sniimjs trom these studies arc siimmarircd 
oeiow, Readers are referred to the origmal studies for details. 

1. Honesty was essentialJy unrelated to age or sex over the range 
of grades studied. There was no tendency for children to learn to be 
more honest as they got older. 

2. The more intelligent children received higher honesty scores. 
Of course, school pressures were probably less severe for brighter 
children. How much the difference in behavior reflects a difference 
in motivational pressures cannot be determined. 

3. Honesty ^vas associated with socio-economic status, children 
from higher socio-economic levels evidencing less dishonesty than 
those from lower levels. 

4. Siblings resembled one another la honesty, and this resemblance 
was more than could be accounted for by familial resemblances in 
intelligence or by the common socio-economic background. 

5. Children in a school following progressive educational practices 
cheated less than comparable children in a conventional school pro- 
gram. 

6. The children within a school as a whole or a class group within 
a school tended to resemble one another In level of honesty displayed. 
There appeared to be a factor of school or class morale. 

7. There was no Indicaiton that children «ho participated in or- 
ganized programs of religious education or who were members of 
groups expressing character education aims were more honest than 
non-pariicipanls or non-mcrobers. 


OTHSn TYPES or PERfORMANCE TESTS 

A number of psychologists have recently been exploring indirect per- 
formance measures as indicators of personality s-ariables. Ej-senck * 
has developed a battery of performance measures to predict ncuroti- 
cism. Some of the measures that have tended to discriminate between 
norma! and neurotic groups arc (I) the amount of body sway rn re- 
sponse to 0 direct suggestion of falling. (2) the number of unusual 
responses on a multiple-choice free association test, (3) J"' 
of dark adaptatlorj, (4) the number of food aversions, and (5) the 
Icncth of time breath could be held. From a battery of etgh or fen 
such specific tests Eysenck was able to get quite high rcliabdity and 
fairly sharp discrimination between a normal and a neurotic grou, . 
Cattcll » has attempted to develop batteries of TI Z7 

.□ nppmke ibe S 3 me ptuonalit, ftcors te he had 
sonality ralins and inventory procednres. ThtJc sppiMchts p- 



3„ behavioral measures of personality 

mined Indirect, objective testing to appraise person y 
primarily an area for further researeh. 

SITUATIONAL TESTS AND ASSESSMENT PROGRAMS 
During and since World War II a number of 

have been set up for "'=''ing a comprehensive appra^a t 

for a particular type of training or for 

publicized of these was the program “P ^.mm 

the Office of Strategic Services dunng World War II 

has been fully described,' and some features “f ^ 

sidering here. Assessment programs have generally 

wide variety of techniques for evaluating the individual. Th y 

included ability tests of several sorts, detailed interviews, ,5. 

types of fantasy and projective materials. IJed 

mLl has been the situational test in which the 

in a more or less standardized task situation where his behavio 

be observed, his responses recorded, or various aspects of his r 

rated by observers. 


S/TUATJONAL TfSTS (N THE OSS ASSESSMENT PROGRAM 
For assessment by the OSS staff, each candidate was 
an assessment center for a 3-day period of testing and evaiua • 
During this period he was continuously under observation an 
subjected to a vside range of tests and stresses. In addition to ® 
tests of a number of kinds — tests of intelligence, mechanical a i >» 
ability to observe and remember details — he was exposed to a nu® 
ber of “situational” tests. These consisted of staged situations, 
fairly complete instructions and ground rules, presenting prob em^ 
that the candidate was to solve, either individually or as a member o 
a group. The variety of situational tests used in the program was 
wide. Selected examples are described briefly in the following pat® 
graphs.* 

The Brook. Individuals worked in teams composed of five or six 
men. The group was brought to a stream about 8 feet wide. On the 

• For fuller descriptions and verbatim instructions, the reader is referred to 
The Assessment of Afe/i.* 



SITUATIONAL TESTS AND ASSESSMENT PROGRAMS 395 

“ P"“'=y. “‘i °ii>« 

Items. They were instructed somewhat as follows: 


In this probJem )ou have to use your imagination Before you you 
see a raging torrent so iJccp and so last that it is quite impossible to 
rest anything upon the bottom of the stream. The banks are sheer, 
son Mill be impossible for you to work except from the top of them.’ 

You arc on a mission in the field, and having come to this brook 
you arc facet! ssith the task of transporting this delicate range-finder, 
skillfully camouflaged as a tog. to the far bank, and of bringing that 
box of percussion caps, camouflaged as a rock, to this side. In carry- 
ing out this assignment, you may make use of any materials you find 
around here. When the job is done, all of you, as well as any materia] 
you have used, are to be back on this side. 

The limits within which you arc to work are marked by the two 
white stakes on each bank (the stakes were approximately 15 feet 
apart), artd you are not permitted to jump across the stream. 


The behavior of each man in the group was observed, as the group 
u-ent about solving the task, and each man was rated on such factors 
as energy and initiative, effective intelligence, social relations, leader- 
ship, and physical ability. 

Construeiion. A single candidate was presented with the task o£ 
building a structure, using materials tesemWing oversized Tinker Toy. 
His task svas described to him, and then he was told that the test was 
primarily one of leadership, since the work was to be done by two 
helpers whom he was to supervise. The “helpers” were called in, 
and the construction project began. However, the “helpers,” who 
w’erc carefully coached assistants, turned out to be sensitive, stupid, 
and obstructive, and their behavior ranged from “gold-bricking to 
systematically heckling their supervisor. Again, the responses of the 
examinee to these frustrations were observed, and he was rated for 
emotional stability and leadership. 

/mprov/sa/ions. This test was one of role-playing. Working m 
pairs, examinees were assigned roles in a dramatic situation and were 
told to enact the scene as they would handle it in real life. Thus, 
one situation was set as follows: 


A moved to a small city about 3 months ago and opened a business 
there. He has been doing quite well and one .f] 

applicalion for membership in a club in the tow^. He has 
ing In response to this application and go« to the home , p 
nent member of the club, with whom he is pleasantly feverat 

is then sent out of the room and BistoM that A has received several 
blackballs, A is then called back into the room.] 



394 BEHAVIORAL MEASURES OF PERSONALITY 

As before, aspects of each individuafs 

themselves in the role-playing situations, were rated by 

s"nre™e.v. Candidates were instructed to assume that the 

following situation had occurred. 

A night watchman at 9:00 P.M. found you going ““ 

papers in a file marked “SECRITr- in “government offi^m 

■ s;;;g i^:s 

papers whatsoever with you. The night watchman has brougm > 
here for questioning. 

The examinee was given 12 minutes to prepare “ ^-as 

count for his presence in the compromising situation. Th 
subjected to an intensive and grueling interrogation in *1^ 

ments were questioned, inconsistencies brought out, and eveiy 

made to trip him up and to make him feel loolis . 

on the quality of his story and his ability to maintain it and upon ra 

evidence of emotional stability. 

Further examples of situational tesu might be cited, but these 
to show the essential characteristics of this type of approach t p 
sonalily appraisal. The attempt is made to develop 
approach realistic lifelike situations but still permit “ . 

amount of conuol from person to person. The OSS staff 
desirable characteristics of situations to be that they (1) have “ ” 
bet of alternative solutions, (2) do not require highly speciaii 
abilities, (3) reveal kinds of behavior that cannot be registere > 
mechanical means, (4) force the candidate to reveal dominant is 
positions of his personality, (5) involve interaction with other person , 
and (6) require the coordination of numerous components of per- 
sonality. 


l£ADERl£SS GROUP DISCUSSfON 

One procedure that provides a somewhat simplified version of 
situational test, and consequently one that is more widely adapta e 
for practical use, is the “leadcrless group discussion.” This approac 
has been used when a number of individuals are to be appraised 
some type of administrative or executive position, such as a sen 
principalship. The candidates are assembled in small groups, a group 
of about six apparently working best. The group is assigned a toj^c 
to discuss or a problem to solve relevant to their background and ® 
position for which they arc candidates. They are allowed a subs^ 
tial block of time — perhaps an hour — to carry on their discussion. 
During that time they are observed by a team of observers and a 



SIIUATIONAI TESTS AND ASSESSMENT PROGRAMS 

of — , .esT have s.m no. been co.- 

pletely explored. 

evaiuation op siiuaiiona. tests tec used 

Situational tests like “ f,te„e character tests in that, 

in the OSS diner from the "f °™cwha. dUguUed situaUon, 

though they still deal with beh nroduct Thus, in the May- 

key do nok yield an actual couu. d.= coins left 

Hartshome stealing test, it . ./’jeote!^ the tests «etc highly 
in the box to detemnne *e exam Situational tests are 

objective as fat as i, nade to present a telatively 

no, objeedve. Though of each examinee’s behaviot 

standard task situation, "i' of the stalf of exammets. 

is through the observations .„,5 ,i,e loss in objectivity, 

me gain front .his approach „„ oe studied Much 

is a great increase in the '“S* .. , j,, relations ivith others, leav 
hat the individual does esp^ctally » bN re,„n 

behavior of this sort and to provio 

-’SLtional tests appear ..^J^t^rSedme.ure 

types of sooiai -d c"™":!; test. H-'even ;h^7 PS^nal 
ment by any rnor 1 „ involving a n ( ^ilities and 

number of probl^s A ^ b, costly in 

tests is costly. The te almost certain , .jjjnration 

r r » j-'t 0rsianS: 

skill on the part of the '• 



3,5 behavioral measures of personality 

^Binlainins the uniformiLy of the situations “p’: 

dividual and from group to group. P' ,hat the task 

venting leakage of information about approached by each 

is a novel one to each group as it « PP^lipa, difficulties 

group with the same background. .-nml rests has been 

Solved, it is not surprising that the use “ 

limited to rather elaborate assessment programs, amng 

tion of special types of personnel— undercover age , 

chology trainees, or executives and admirastratoR. 

The actual value of situational tests and, in fact, 
elaborate assessment program remains somewhat 
chologists who have participated in the h I’ ; ^ppp 

cases, enthusiastic about the procedures. bop, the 

that is elicited has real value in predicting important facts u 
individual is another matter. In the OSS program. • 
to obtain only a limited amount of evidence on the extent n 
men who had gone through the assessment program _ 

in their job assignments. Ratings from overseas collea^es and 
ations by commanding officers were obtained in a fraction i 
cases. Predictions of success did correlate significantly wit 
on the job. The evaluation that showed the highest correlauo 
rating lot ejecrive intelligence. The final rating for 
gence based on the complete 3-day program had a somewhat hi. 
correlation with rated success on the job than did scores base 
a brief objective test of verbal abflity, but the difference was not ^ • 
In another extensive program of situational testing, designe 
the selection of clinical psychology trainees * there was no evi en ^ 
that the addition of situational tests improved prediction beyond w a 
was possible from the individual’s credential file, selected objective tes 
and a personal autobiography. In summary then, the situational tj-pe 
of test is an interesting additional tool for personality assessment, 
seems to provide a direct opportunity to see the individual functioning 
in lifelike situations and thus to appraise a variety of aspects of lea ct 
ship, cooperation, and social functioning. However, evidence for ^ 
value of the results as improving our prediction of the individual s sue 
cess on the job is largely lacking. Because its practical value has no 
been demonstrated and because the techniques are costly in preparations 
required and in the time of testing personnel, situational testing must 
be considered a subject for research at the present time, rather than 
a proven tool for personnel evaluation. 



SYSTEMATIC OBSERVATION 


39P 


SYSTEMATJC OBSERVATION 

The sUusllonal tcse has iniroduced us to observation as a technique 
for studying the typical behavior of the individual. Observation in 
that instance was of what he did in specified test situations. We turn 
now to observation in the naturally occurring situations of everyday 
hfc. The situations of ewrj'day life arc probably less uniform from 
person to person than the lest situations that we stage. AJso, they 
are not loaded to bring forth the behaviors in which we are specially 
interested. However, the very naturalness of real life events and the 
fact that we do not have to stage special events just for testing pur- 
poses make observation of natural situations appealing to us. 

Of couRC, we observe the people with whom we associate every 
day of our lives, noticing what they do and reacting to the ways in 
which they behave. Our impressions of people are continuously be- 
ing formed and modified by our observations of them. But these ob- 
servations are casual, unsystematic, and undirected. If we are asked 
to document with specific instances our judgment that John is a leader 
in his group or that Henry is undependable, we are usually put to it 
to provide more than one or two concrete observations of actual be- 
havior to document our general impression. Observations must be 
organized, directed, and systematic if they are to yield dependable 
information about an individual. 

We should perhaps pause to draw a distinction between the observa- 
tional procedures that we discuss now and (he rating procedures that 
we considered in Chapter 13. The basic distinction is this: when we 
are collecting observations, we want the observer to function as nearly 
as possible as an objective and mechanical recording instrument, 
whereas when we gather ratings we want the rater to synthesize and 
integrate the evidence that he has. The one function is purely that 
of providing an accurate record of (he number of social contacts, 
suggestions, aggressive acts, or whatever the category of behavior may 
be vw viViVcb vie ate vR^esetted. The observer serves merely as a some- 
what more flexible and versatile camera or recording machine. In 
rating, by contrast, the human instrument must judge, weigh, and 
interpret. 

Systematic observational procedures have been most fully devel- 
oped in connection with studies of young children. They seem par- 
ticularly appropriate in this setung. On the one hand, the young 
child has not developed the covers and camouflages to conceal him- 



JOO behavioral measures of personality 

procedures have had their fullest development. 

STEPS TO IWROVE OBSERVATIONAL PROCEDURES 

Many of the early studies of young children t", 

development of a particular child or of two or f 

on observations by a psychologist parent, ^ese p £ 

descriptive background for understanding the young 
were ^itative and lacking in precision. Careful 
chUd or investigations to determine the effect of f . ^ 3 , 

environmenu or experiences require that we ,‘^"7 
he shows negativism and resistance, for example, but als 
or how often. The needs of measurement, as disunct from 
qualitative description, require observational proccdurw that pc 
mit a statement of quantity, of amount. The procedure s 
as objective and reliable as possible, with a minimum of depenoe 
upon the whims and idiosyncrasies of the indisiduaf 
accomplish this, several precautions are typically undertaken, 
are discussed below. 

1, Selecting the Aspect of Behavior to Be Observed. One 
of the general observer of human behavior is that he does not ow 
what he is looking for. So much is happening in any situation in- 
volving one or more active human beings that some part of it mus 
inesilably be missed. We cannot notice cverjlhing that happens, an 
we cannot record ever>’thing that we notice. In any program of sj'S' 
tematic observation, we must first select certain aspects or calegones 
of behavior to be ob^rved. Thus, in a study of nursers’-school pupi s, 
we may be interested in a^essivc behasior and may limit oursel'C* 
to instances of aggressiveness. In a research project to evaluate a 
school program, we may be interested in observing evidences of c^ 
operation or of independently initiated activity and may restrict our 
observation to these. 

2. Defining the Behaviors That Fall Mithin a Category. If 
tvs’O obsers'crs loose without further ado to observe the occurrence o 
“aggressive acts” or “ners’ous behavior” in preschool chfldren, we will 
find that there are many disagreements between them in the otKcrva- 
tions the>’ make. Our categories must be further specified. The> 
must become more behavioral if we are to gel good agreement be- 



SYSTEMATIC OBSERVATION 
tween observers. ^Vhat is an 

we vvish to .nclude J “ * "ability to get and 

one's scat in the second? 3 „ indes or making 

interpret data" 'P'"' ‘ translate “aggressive acts" 

intcrences from a bar chart, name-calling, and the 

into hitting, kicking, biting, P“*' Jj J ™ J ;„,,„ded, based upon 
like. An advance agreement on what b to o 
prior studies of the domain in qnesuon, is a neeessaiy 

obiectivc and reliable carefully defined set of be- 

3. Trnfnins Otervm- Even Some 

haviors to be ‘‘“f^Xtuations of attention or variation 

of these are unavoidable du however, be eliminated 

of scoring on close 1“<'S"'.“'"- or more observers make 

by training. Practice “mpare notes, discuss d.s- 

records of the same =™P^ ' ide one means of increasing 

tlon categories. .. observations of some ' 

ehnd'fb“htlor^isaS8tessivea«s^ 

Sir." 

5 « ihU case one often has difhcuuy ^ j.,cks him. is 

Se'^nexToVe begins. TthfacXn’s dow over from one to 

t ^ a nmnber^uases 

An expedient f ‘ “ ^d of o>>-™ri.on -P 
has been to brea ^^P^ ,ban a ,he occurrence 

™T‘'«h men the observation ft-' “ „[ tehaL during each 

in length. Then particular category jen 

or t'°"-‘>ee“"'="“ bnie 'm«- ""= Sl's-minute periods might 

small segment cff^odillerentday. me 5 poc each of the 

5 -minute periods, eac j4.minute „,rticular child did 

■ -- t^SLess. /ueh scores, based 

Indicating the extent 



^02 BEHAVIORAI MEASUBES Of PERSONAUTY 

on an adequate number of 

obsen'ed in watching a chUd .n '“7^b’servations are 

account of what was obscrs'cd rs only W': jf ‘ 50 

recorded immediately. There rs so much to see ^ ^^^„„te 

much like others that to rely upon memory >“ 
after-lhc-fact account of a child s behavior is fatal. though 

the case in any attempt at complete and "“dotal recordina 

we shaU find a place for selective observation ^“mken 

of significant incidents of behavior some time after they h 

*''Tny program of systemaUc observation must, ’!'“'"^°'®he*'m'rnts 
some technique for immediate and fLilitating re- 

that are observed. There are many possibilities for “‘““I 
cording of behavior observations. One that has been ^ “ „,jt 
been to deselop a systematic code for the J to 

are of interest. Thus, preliminary observations wall have se 
define the range of aggressive acts that can be expected from - __ 

4-ycar-olds. Part of the code might be set up asjollows: li - • 

p = “pushes,” s - "grabs materials away from,” n - “calls a 
name,” and so forth. A record blank can be prepared, ‘ 

to represent the time segments of the observations, and code en 
can be made quickly while the child is observed almost without m 

ruplion. fniler 

If the obsenxr is skilled in standard shorthand, of course, 
notes of the observation can be taken. These can be transcribe 
coded or scored later. In some cases, where a research ^ 

liberal financial backing, more complete photographic or sound- 
recordings of the observations may make possible a permanent recor 
of the behaviors in a relatively complete form, "niese records ca^ 
then be analyzed at a later dale. Such resources are likely to be 
exception, however, and in many cases it will be necessaiy to pl^ ® 
simple and efficient code to provide an immediate and permanent lec 
ord of what was observed. The important objectives here are to o 
away with dependence on memory, to get a record that will preserve 



